8

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried

Method 1

awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted

Method 2

awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted

Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?

EDIT

awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to @Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

bbvan
  • 618
  • 1
  • 7
  • 17
  • 1
    Try adding -S8G to allocate an 8G buffer to the sort reading from the pipe and see if that helps. To measure consistently you might want to [drop caches](https://linux-mm.org/Drop_Caches) before each run (e.g. myBigFile should be cached after reading it the first time), although that will not explain such a big difference. – JimD. Apr 12 '17 at 08:12
  • @JimD. This really helps! Now it is slightly slower (134 seconds vs 105 seconds) than writing intermediate file to SSD (but shouldnt it be faster ...?) – bbvan Apr 12 '17 at 09:18
  • When reading from a pipe, sort doesn't know how long the input is. When reading from a file it knows how big it is. Beyond this, the file is cached on your nice system with 128GB of memory, so the sort is not actually reading from disc, but from cache. So I think the difference is attributable to sort knowing the input is huge and effectively dividing the work efficiently, whereas the sort in the pipeline has to discover the input is huge (which it won't even do without the -S). – JimD. Apr 12 '17 at 09:23
  • Also, sort can seek on the file, but not on the pipe. You'd have to take a look at the implementation to see if it takes advantage of that. – JimD. Apr 12 '17 at 09:30
  • @JimD. Thanks for this detailed explanation. Could you kindly please make it an answer so I can accept it? – bbvan Apr 12 '17 at 10:02

3 Answers3

7

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.

This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

JimD.
  • 2,323
  • 1
  • 13
  • 19
  • Do you need to specify both `-S` and `--parallel` or is `sort` smart enough to know to use the memory without the `--parallel` switch? – Hashim Aziz Dec 02 '20 at 05:54
  • In my experience -S alone will do it, and that reading from a pipe --parallel alone wasn't enough. See, e.g, https://superuser.com/questions/938558/sort-parallel-isnt-parallelizing – JimD. Dec 02 '20 at 19:35
  • I am a little confused by the statement "the data need not be written to disc before awk exits, and sort will be able to read the file from cache" doesn't the file need to be written to disc for this to work? – Cole Dec 10 '20 at 06:05
  • @Cole No, linux writes to the cache and marks the pages as dirty. Dirty pages are eventually flushed to disc, but this is not necessary for a file read to hit a cached dirty page. See, e.g., https://www.oreilly.com/library/view/understanding-the-linux/0596005652/ch15s01.html – JimD. Dec 10 '20 at 15:53
3

I think sort does not use threads when read from pipe.

  1. I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:

    /usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4  > bf.txt"
    
  2. I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:

    /usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
    

In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.

Below are results for the first and second scenarios:

$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4  > bf.txt"
        Command being timed: "bash -c seq 1 10000000 | sort --parallel 4  > bf.txt"
        User time (seconds): 35.85
        System time (seconds): 0.84
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 9320
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2899
        Voluntary context switches: 1920
        Involuntary context switches: 1323
        Swaps: 0
        File system inputs: 0
        File system outputs: 459136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
        Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
        User time (seconds): 43.03
        System time (seconds): 0.85
        Percent of CPU this job got: 175%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1018004
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2445
        Voluntary context switches: 299
        Involuntary context switches: 4387
        Swaps: 0
        File system inputs: 0
        File system outputs: 308160
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
2

You have more system calls, if you use the pipe.

seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059

Without the pipe the file is mapped into memory.

seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33

Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

ceving
  • 21,900
  • 13
  • 104
  • 178