297

I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).

I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.

Is there any way I can utilize the unused cores to make it faster?

BSMP
  • 4,596
  • 8
  • 33
  • 44
user1118764
  • 9,255
  • 18
  • 61
  • 113
  • The solution proposed by Xiong Chiamiov above works beautifully. I had just backed up my laptop with .tar.bz2 and it took 132 minutes using only one cpu thread. Then I compiled and installed tar from source: https://www.gnu.org/software/tar/ I included the options mentioned in the configure step: ./configure --with-gzip=pigz --with-bzip2=lbzip2 --with-lzip=plzip I ran the backup again and it took only 32 minutes. That's better than 4X improvement! I watched the system monitor and it kept all 4 cpus (8 threads) flatlined at 100% the whole time. THAT is the best solution. – Warren Severin Nov 13 '17 at 04:37

8 Answers8

438

You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.

For example use:

tar -c --use-compress-program=pigz -f tar.file dir_to_zip 
Jen
  • 4,529
  • 1
  • 12
  • 6
  • 32
    This is an awesome little nugget of knowledge and deserves more upvotes. I had no idea this option even existed and I've read the man page a few times over the years. – Randall Hunt Nov 13 '13 at 10:01
  • 2
    @ValerioSchiavoni: Not here, I get full load on all 4 cores (Ubuntu 15.04 'Vivid'). – bovender Sep 18 '15 at 10:14
  • 11
    I prefer `tar - dir_to_zip | pv | pigz > tar.file` pv helps me estimate, you can skip it. But still it easier to write and remember. – Offenso Jan 11 '17 at 17:26
  • @NathanS.Watson-Haigh Yes do you. Just enclose the program name and arguments in quotes. `man tar` says so, as does [this](https://stackoverflow.com/a/51275570/3258851). – Marc.2377 Feb 01 '20 at 00:25
  • 16
    In 2020, `zstd` is the fastest tool to do this. Noticeable speedup while compressing and decompressing. Use `tar -cf --use-compress-program=zstdmt` to do so with multi-threading. – jadelord Feb 05 '20 at 12:42
  • Confirmed. on a 32-core Threadripper Pro, compressing 3.6GB LJSpeech dataset via zstd needs less than half the time than pigz: `real 0m3.219s <=> real 0m6.570s` – lumpidu Feb 20 '23 at 22:01
407

You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:

tar cf - paths-to-archive | pigz > archive.tar.gz

By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.

tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
evandrix
  • 6,041
  • 4
  • 27
  • 38
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • 7
    How do you use pigz to decompress in the same fashion? Or does it only work for compression? – user788171 Feb 20 '13 at 12:43
  • 64
    pigz does use multiple cores for decompression, but only with limited improvement over a single core. The deflate format does not lend itself to parallel decompression. The decompression portion must be done serially. The other cores for pigz decompression are used for reading, writing, and calculating the CRC. When compressing on the other hand, pigz gets close to a factor of _n_ improvement with _n_ cores. – Mark Adler Feb 20 '13 at 16:18
  • 10
    The hyphen here is stdout (see [this page](http://unix.stackexchange.com/questions/41828/what-does-dash-at-the-end-of-a-command-mean)). – Garrett Mar 01 '14 at 07:26
  • 2
    So as far as I understand files generated by pigz are compatible with gzip right? Can I decompress a file with gzip which had been created with pigz? – slhsen Jul 02 '14 at 14:23
  • 4
    Yes. 100% compatible in both directions. – Mark Adler Jul 02 '14 at 21:29
  • 2
    pigz can use multiple cores for compression, but the tar operation is still using only one core. Is there a parallel tar? – CharlesL Apr 22 '15 at 20:45
  • 8
    There is effectively no CPU time spent tarring, so it wouldn't help much. The tar format is just a copy of the input file with header blocks in between files. – Mark Adler Apr 23 '15 at 05:23
  • 1
    I have submitted an edit to the answer to indicate the default number of compression thread to be equal to the number of online processors as per official docs but not to the 8 cores as was specified in the answer originally. Thanks. – kasur Oct 21 '15 at 10:17
  • 1
    The edit seems to have been rejected by someone else, but I will make a similar edit. – Mark Adler Oct 21 '15 at 11:58
  • 2
    just drop the -f option of tar if you want stdout. ;-) – Lester Cheung Dec 09 '15 at 23:38
  • Beware that redirecting (`>`) will simply overwrite existing files unless you have `set -o noclobber` set. – jmiserez May 13 '16 at 10:10
  • 1
    Also worth noting that since pigz probably is going to be network-bound in most situations unless you make it work hard, increasing the block size can dramatically improve performance. By increasing its block size to 524288 (512MB), I'm seeing numbers as high as 80MB/s over 802.11ac wifi. I believe the transfer is still network-bound, so you may see better results over gigabit ethernet. I sometimes see insane 400MB/s spikes, but those are scary and odd, so I'm not sure what to make of them. – William T Froggard Mar 02 '18 at 08:34
  • 2
    @WilliamTFroggard The spikes may be due to the burstiness of the deflate algorithm. Uncompressed data is collected until a deflate block can be produced, at which time the block is rapidly generated and emitted. – Mark Adler Apr 17 '18 at 16:19
  • Wouldn't be more performatic to use `-l` instead of STDIN/STDOUT? – Andre Figueiredo Dec 24 '18 at 05:09
  • I wouldn't know, since "performatic" is not a word. – Mark Adler Dec 24 '18 at 07:07
  • This is actually faster than `tar -c --use-compress-program=pigz` – Arik May 21 '20 at 13:42
142

Common approach

There is option for tar program:

-I, --use-compress-program PROG
      filter through PROG (must accept -d)

You can use multithread version of archiver or compressor utility.

Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:

$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive

Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):

$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz

Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.

p7zip

For p7zip for compression you need a small shell script like the following:

#!/bin/sh
case $1 in
  -d) 7za -txz -si -so e;;
   *) 7za -txz -si -so a .;;
esac 2>/dev/null

Save it as 7zhelper.sh. Here the example of usage:

$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z

xz

Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").

This is a fragment of man for 5.1.0alpha version:

Multithreaded compression and decompression are not implemented yet, so this option has no effect for now.

However this will not work for decompression of files that haven't also been compressed with threading enabled. From man for version 5.2.2:

Threaded decompression hasn't been implemented yet. It will only work on files that contain multiple blocks with size information in block headers. All files compressed in multi-threaded mode meet this condition, but files compressed in single-threaded mode don't even if --block-size=size is used.

Recompiling with replacement

If you build tar from sources, then you can recompile with parameters

--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip

After recompiling tar with these options you can check the output of tar's help:

$ tar --help | grep "lbzip2\|plzip\|pigz"
  -j, --bzip2                filter the archive through lbzip2
      --lzip                 filter the archive through plzip
  -z, --gzip, --gunzip, --ungzip   filter the archive through pigz
antonagestam
  • 4,532
  • 3
  • 32
  • 44
Maxim Suslov
  • 4,335
  • 1
  • 35
  • 29
  • 1
    This is indeed the best answer. I'll definitely rebuild my tar! –  Apr 28 '15 at 20:41
  • 1
    I just found [pbzip2](http://compression.ca/pbzip2/) and [mpibzip2](http://compression.ca/mpibzip2/). mpibzip2 looks very promising for clusters or if you have a laptop and a multicore desktop computer for instance. –  Apr 28 '15 at 20:57
  • This is a great and elaborate answer. It may be good to mention that multithreaded compression (e.g. with `pigz`) is only enabled when it reads from the file. Processing STDIN may in fact be slower. – oᴉɹǝɥɔ Jun 10 '15 at 17:39
  • 3
    Plus 1 for `xz` option. It the simplest, yet effective approach. – selurvedu May 26 '16 at 22:13
  • 4
    `export XZ_DEFAULTS="-T 0"` before calling `tar` with option `-J` for xz compression works like a charm. – scai Dec 21 '18 at 15:24
  • This answer looks like it was largely lifted directly from my [LQ post](https://www.linuxquestions.org/questions/linux-software-2/utilizing-multi-core-for-tar-gzip-bzip-compression-decompression-4175426075/#post5040938). A link back might have ben nice. – ruario Feb 11 '21 at 06:31
14

You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:

tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/
einpoklum
  • 118,144
  • 57
  • 340
  • 684
panticz
  • 2,135
  • 25
  • 16
  • A nice TL;DR for @MaximSuslov's [answer](http://stackoverflow.com/a/27541309/1593077). – einpoklum Feb 11 '17 at 15:59
  • This returns `tar: home/cc/ziptest: Cannot stat: No such file or directory tar: Exiting with failure status due to previous errors` ` – Arash Mar 25 '20 at 00:48
2

If you want to have more flexibility with filenames and compression options, you can use:

find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s@/my/path/@@g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz

Step 1: find

find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec

This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.

-exec will execute the next command using the results of find: tar

Step 2: tar

tar -P --transform='s@/my/path/@@g' -cf - {} +

--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.

-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.

-cf - tells tar to use the tarball name we'll specify later

{} + uses everyfiles that find found previously

Step 3: pigz

pigz -9 -p 4

Use as many parameters as you want. In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression. If you run this on a heavy loaded webserver, you probably don't want to use all available cores.

Step 4: archive name

> myarchive.tar.gz

Finally.

Bloops
  • 744
  • 1
  • 11
  • 15
2

A relatively newer (de)compression tool you might want to consider is zstandard. It does an excellent job of utilizing spare cores, and it has made some great trade-offs when it comes to compression ratio vs. (de)compression time. It is also highly tweak-able depending on your compression ratio needs.

pgebhard
  • 59
  • 1
  • 10
2

You can speed up decompression by using a multi-threaded gzip decoder like rapidgzip. You can use it with tar like this:

python3 -m pip install --user rapidgzip
tar -x --use-compress-program=rapidgzip -f archive.tar

With a Ryzen 3900X 12-core processor, it can easily achieve a 12x speedup for simple gzip decompression, not accounting for GNU tar. These are the results for a 4 GiB large file (compressed size: 3.1 GiB):

Decoder Runtime / s Bandwidth / (MB/s)
rapidgzip -P 24 1.320 3254
rapidgzip -P 1 8.811 487
igzip -T 24 9.295 462
igzip 9.225 466
bgzip -@ 24 15.962 269
bgzip 16.202 265
pigz 13.391 321
gzip 22.218 193

igzip is a good alternative as well. It can be installed with: sudo apt install isal. Just as pigz, it cannot arbitrarily parallelize decompression but, just as bgzip and pigz, it can parallelize compression with the --threads option.

A second alternative would be bgzip, which can be installed with: sudo apt install tabix. Although bgzip cannot parallelize decompression of arbitrary gzip files, it can parallelize decompression for files compressed with bgzip, see e.g. these benchmarks for the same file as above but compressed with bgzip:

Decoder Runtime / s Bandwidth / (MB/s)
rapidgzip -P 24 1.125 3818
rapidgzip -P 1 7.520 571
igzip -T 24 7.377 582
igzip 7.321 587
bgzip -@ 24 1.949 2204
bgzip 10.621 404
pigz 18.466 233
gzip 21.346 201

The code for the benchmarks can be found here.

mxmlnkn
  • 1,887
  • 1
  • 19
  • 26
  • I tried to use rapidgzip for compression, but not work. Decompression works fine. Any reason why? – StayFoolish Aug 28 '23 at 10:09
  • @StayFoolish It wasn't intended for compression. Parallel compression of gzip is a solved problem. You can use pigz or bgzip for that. However, parallel decompression of any gzip file only works with rapidgzip. I understand the desire to use rapidgzip as a drop-in-replacement for gzip, so I might add compression in the future. – mxmlnkn Aug 28 '23 at 12:48
1

Here is an example for tar with modern zstd compressor, as finding out good examples on this one was difficult:

  • Do recursive and directiores (zstd standalone cannot do this)
  • apt poem to install zstd and pv utilities for Ubuntu
  • Compress multiple files and folders (zstd command alone can only do single files)
  • Display progress using pv - shows the total bytes compressed and compression speed GB/sec real-time
  • Use all physical cores with -T0
  • Set compression level higher than the default with -8
  • Display the resulting wall clock and CPU time used after the operation is finished using time
apt install zstd pv
DATA_DIR=/path/to/my/folder/to/compress
TARGET=/path/to/my/arcive.tar.zst

time (cd $DATA_DIR && tar -cf - * | pv  | zstd -T0 -8 -o $TARGET)
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435