Brain Phrye

code cooking diy fiction personal photos politics reviews tools 

The Unix Way

I was reminded of this article which is a nice little example of how to use shell tools in Unix to solve problems faster than if you were to “write a program.”

You’re still coding in shell, it’s just that each “line” is a pretty powerful function. Plus each function has a pretty simple interface: it will take a text stream and some positional or named arguments and it will produce two text streams (one is usually “the output” and the other is usually an “out of band” stream) and an integer.

And within these constraints, you can write really powerful programs. Sometimes ridiculously more powerful.

Looking at McIlroy’s code there’s another thing to note:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

The commands leading up to each sort can run in parallel; consuming and using the output of the previous step in the pipeline as it is created. The sorts are bottlenecks where the data builds up until all previous pipeline steps complete and the sort begins spewing output.

So in the example above, first the two trs and the first sort run in parallel. Once the trs exit, sort begins to sort the data and then starts to output. At that point the first sort, the uniq and the second sort run in parallel. Once the first sort and the uniq complete, the second sort begins to sort and once output happens it and the sed run in parallel - with the sed quitting causing the second sort to exit before finishing output (in all likelihood).

The shell-v-hadoop article above goes into way more detail on the builtin, implicit parallelization of shell, but it’s an impressive feature of the language and is one that can bring surprising performance benefits if used correctly.