Brain Phrye

code cooking diy fiction personal photos politics reviews tools 


Doing sets in shell

Sometimes I need to take two newline delimited lists and do set operations on them. These generally are outputs from commands.

You can do unions, intersections and differences - and in this post I’m going to explore the latter. Specifically A - B.

Say you have three directories, A, B and C and you want all the files in A that aren’t in B copied into C. To do this, you can do the following:

1
2
3
4
5
{
  (cd A && find .);
  (cd B && find .);
  (cd B && find .)
} | sort | uniq -u | (cd A && cpio -p ../C)

This works fine but the duplication of the find is kind of annoying. Without it you’d end up with symmetric difference, A ⊖ B, not difference. The bits that A and B don’t have in common.

But can it be done without running the command twice? For this find it’s probably not too bad, but some commands are more compute/IO intensive.

The answer, as it often is in shell, is sed:

1
2
3
4
{
  (cd A && find .);
  (cd B && find . | sed p)
} | sort | uniq -u | (cd A && cpio -p ../C)

This will just duplicate each line as sed by default will print each line, the p command just prints it one more time.