How to write multicore sorting using GNU Parallel

Question

GNU parallel is a shell tool for executing jobs in parallel using one or more computers

For example, if I want to write a multicore version of wc I could do:

cat XXX | parallel --block 10M --pipe wc -l | awk 'BEGIN{count=0;}{count = count+ $1;} END{print count;}'

My question is how to do sorting using parallel? I know what I should do is pipe the result of parallel to a "merge sorted files" command(just like the final merge in merge sort), but I don't know how to do that.

Have you read: http://www.gnu.org/software/parallel/man.html#EXAMPLE:-Processing-a-big-file-using-more-cores — Ole Tange, Jan 16 '15 at 00:54
@OleTange This bring us back to the no.1 principle. rtfm. yeah. I should read that. Thanks. — qqibrow, Jan 16 '15 at 03:06

score 3 · Accepted Answer · edited May 23 '17 at 10:25

There's a few ways to do this.

Let's get a simple text file to play with:

$ curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt 2>/dev/null |
   tr " " "\n" | tr "[A-Z]" "[a-z]" | 
   sed -e 's/[[:punct:]]*//g' -e '/^[[:space:]]*$/d' > moby-dick-words.txt

$ wc moby-dick-words.txt

215117 moby-dick-words.txt
$ time sort moby-dick-words.txt > moby-dick-words-sorted.txt

real    0m0.260s
user    0m0.462s
sys 0m0.004s

We can do the sorting on chunks of the text, say 10000 words at a time, and defer some of the hard, serial work to the merging (sort -m) part:

$ mkdir tmp
$ time (
  cd tmp;
  split -l 1000 ../moby-dick-words.txt;
  parallel sort {} -o {}.sorted ::: x*;
  sort -m *.sorted > ../moby-dick-words-sorted-merge.txt;
  rm x* )

real    0m0.787s
user    0m0.495s
sys 0m0.103s

$ diff moby-dick-words-sorted.txt moby-dick-words-sorted-merge.txt 

$ uniq -c moby-dick-sorted-merge.txt | tail      
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

So this splits the text into sequential 10000-line chunks, uses parallel to sort each chunk, and then uses sort -m to merge the sorted chunks into a complete sort.

The next approach would be to do the hard work at the split stage, rather than the merge stage, so that the partial results can be merged together by a simple cat:

  $ rm tmp/*
  $ letters="a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9"
  $ time (
    cd tmp; 
    parallel sed -e "/^{}/w{}.txt" ../moby-dick-words.txt ::: $letters >& /dev/null;
    parallel sort {}.txt -o {}.sorted.txt ::: $letters;  
    cat *.sorted.txt > ../moby-dick-words-sorted-split.txt;
    rm *.txt )

  real  0m1.015s
  user  0m2.355s
  sys   0m0.510s
  $ diff moby-dick-words-sorted-split.txt moby-dick-words-sorted.txt
  $ uniq -c moby-dick-words-sorted-split.txt | tail
  1 zeuglodon
  1 zigzag
  5 zodiac
  1 zogranda
  4 zone
  1 zone
  2 zoned
  3 zones
  2 zoology
  1 zoroaster

Here we (in parallel) split the file by the first character of the line; sort those files individually; and then the merge is a simple concatenate.

Note that this really for entertainment/educational purposes only; later versions of gnu sort have parallelism built in (look at the --parallel option) which will do a much better job than this. And a slicker version of the of the merge approach can be seen in this answer.

I realize this is a bit of a forced example, but the performance is getting worse as complexity increases. What are we gaining? — glenn jackman, Jan 15 '15 at 19:48
@glennjackman Not a thing! In these cases, the extra file I/O completely dominates any benefit from the multiple cores. One could do a better job performance-wise by limiting the number of jobs launched, using pipes instead of the filesystem, and working on larger data, but really doing sorting with parallel is only useful (IMHO) as an example of a general technique. The right thing to do to use multi-core sorting would be to just use `sort --parallel`. — Jonathan Dursi, Jan 15 '15 at 19:52
@JonathanDursi Fantastic explanation! I understand this is for entertainment/educational purposes only. But is it possible that using parallel in a cluster could compete with map-reduce using other languages? and if not, when should we use GNU parallel? — qqibrow, Jan 15 '15 at 20:40
@qqibrow That's where things could get a little more interesting - using (say) gnu parallel to distribute things across nodes on a cluster, and sort --parallel to do the multicore sorting within a node. That way you'd have access to more RAM to do the sorting in which should improve performance (although note that `sort`s external-memory sorting is quite performant). Once one is going across nodes, there's no real reason gnu parallel would be any worse a way to farm out jobs than any other job-farming approach; but not all use-cases map nicely to gnu parallel. — Jonathan Dursi, Jan 15 '15 at 21:00
Avoid using `split` first. It will kill your performance. Instead use --pipe-part: That way you will only be reading the file once, and you will have multiple CPUs doing it. — Ole Tange, Jan 16 '15 at 00:50
@OleTange - There's lots of ways to get less terrible performance here, but it will always be much slower than just `sort --parallel` in multicore, as per the OP. The intention in my answer is to be explicit about what's going on to demonstrate the basic ideas. You may want to add another answer to describe various performance improvements, which are worth knowing about in their own right. — Jonathan Dursi, Jan 16 '15 at 03:46

How to write multicore sorting using GNU Parallel

1 Answers1

Linked