59

I have a set of servers filled each with a bunch of files that can be gzipped. The servers all have different numbers of cores. How can I write a bash script to launch a gzip for each core and make sure the gzips are not zipping the same file?

User1
  • 39,458
  • 69
  • 187
  • 265
  • Are you sure that HDD speed will not limit them? – ruslik Dec 03 '10 at 00:39
  • 1
    @rulik, exactly, HDD speed will be the bottleneck or gzip would have added multi processor support long ago. – Byron Whitlock Dec 03 '10 at 00:41
  • 13
    I disagree. Running gzip on a series of files in my experience pegs the CPU at 100%, disk I/O remains low. Yes, in a very extreme case you might see disk I/O become the next bottleneck, but this is an excellent reason to use those extra cores instead of running single threaded. – Demosthenex Dec 03 '10 at 02:23
  • 1
    @Demosthenex is right. I thought the HDD would be the bottleneck too, but top is showing that the CPU is pegged. – User1 Dec 03 '10 at 15:09
  • 2
    @Demosthenex @User1 I stand corrected. Thank you for the education! – Byron Whitlock Dec 03 '10 at 18:18
  • If you had an insane number of powerful cores (ie: 64!) you might be able to generate significant I/O versus the CPU time, but it would be a very extreme case. – Demosthenex Dec 04 '10 at 04:44
  • @Demosthenex If you had an insane number of powerful cores, you would also have an insane SLC SSD array. For reference, my desktop class SSD array writes at up to 550MB/s (but usually only consistently writes 150-300MB/s) – Stephen Jul 10 '13 at 00:26

3 Answers3

95

There is an implementation of gzip that is multithreaded, pigz. Since it is compressing one file on multiple threads, it should be able to read from disk more efficiently, compared to compressing multiple files at once.

David Yaw
  • 27,383
  • 4
  • 60
  • 93
  • 1
    I think that's a superior solution! If each block to be compressed runs in separate threads, it is superior to using something like xargs to launch one process per file! On the other hand, if you can't install custom software on $X servers, you can fall back to the xargs behavior. Great find! – Demosthenex Dec 03 '10 at 03:14
  • 2
    This is great to know. Unfortunately, pigz is not on our servers. :( – User1 Dec 04 '10 at 00:15
  • 1
    Note: pigz can only do parallel compression, not decompression (more of a limitation of gz compression than pigz if I understand well). When decompression pigz does still use 4 threads, to separate reading, writing and checking. – qwertzguy Nov 19 '15 at 14:49
69

If you are on Linux, you can use GNU's xargs to launch as many processes as you have cores.

CORES=$(grep -c '^processor' /proc/cpuinfo)
find /source -type f -print0 | xargs -0 -n 1 -P $CORES gzip -9
  • find -print0 / xargs -0 protects you from whitespace in filenames
  • xargs -n 1 means one gzip process per file
  • xargs -P specifies the number of jobs
  • gzip -9 means maximum compression
Demosthenex
  • 4,343
  • 2
  • 26
  • 22
  • 13
    It's not necessary to export the variable. You should use `$()` instead of backticks. It's not necessary to use `cat` - `grep` accepts a file as an argument. GNU `grep` (if not others as well) can count, so you don't need `wc`. End result: `CORES=$(grep -c ^processor /proc/cpuinfo)` – Dennis Williamson Dec 03 '10 at 03:23
  • 1
    You're absolutely right. I was lazy catting around in proc looking for it, and left it cobbled together. That's much cleaner. – Demosthenex Dec 03 '10 at 04:20
  • If you want to reserve let's say 2 processors for other programs, you could use the following (there is probably a cleaner or more bash-ish way to do this): CORES=$(grep -c '^processor' /proc/cpuinfo | perl -ane 'print $F[0] - 2') – Morlock Sep 19 '12 at 15:09
  • That makes sense too. I'd think that if they want to reserve cores, they'd just specify a number by hand instead of trying to detect. – Demosthenex Sep 19 '12 at 20:50
  • 1
    BTW, find / xargs work on any Unix-like system (such as Mac OS X), not just on Linux. The only Linux-specific thing here is /proc/cpuinfo. If you set CORES manually (or find some other way of getting it), you can use this anywhere. – Paul Legato Feb 21 '13 at 23:44
  • Not every unix uses the GNU implementations of find & xargs. AIX for instance has it's own xargs which doesn't support -P. Solaris, HP-UX, and several other proprietary implementations may also have the same issue. It won't work anywhere, but it should work anywhere GNU utilities are installed and cores is set manually. – Demosthenex Feb 22 '13 at 14:48
  • You can probably just use nproc in place of parsing the output of /proc/cpuinfo, e.g. find /source -type f -print0 | xargs -0 -n 1 -P $(nproc) gzip -9 – higginse Mar 27 '19 at 11:34
  • why don't you just use `$(nproc)` xD – france1 Nov 02 '22 at 11:58
  • @france1 - Check the date. `nproc` was first included in coreutils 8.1 in 2009, and probably wasnt (yet) available in the OP's distribution. – Dave May 22 '23 at 06:03
7

You might want to consider checking GNU parallel. I also found this video on youtube which seems to do what you are looking for.

Gangadhar
  • 1,893
  • 9
  • 9
  • 1
    Parallel mentions that it uses similar flags to xargs, ironically I found out recently that xargs now includes the ability to launch multiple processes, see my answer. – Demosthenex Dec 03 '10 at 02:25