10

I am using pythons bz2 module to generate (and compress) a large jsonl file (bzip2 compressed 17GB).

However, when I later try to decompress it using pbzip2 it only seems to use one CPU-core for decompression, which is quite slow.

When i compress it with pbzip2 it can leverage multiple cores on decompression. Is there a way to compress within python in the pbzip2-compatible format?

import bz2,sys
from Queue import Empty
#...
compressor = bz2.BZ2Compressor(9)
f = open(path, 'a')

    try:
        while 1:
            m = queue.get(True, 1*60)
            f.write(compressor.compress(m+"\n"))
    except Empty, e:
        pass
    except Exception as e:
        traceback.print_exc()
    finally:
        sys.stderr.write("flushing")
        f.write(compressor.flush())
        f.close()
worenga
  • 5,776
  • 2
  • 28
  • 50
  • From what I've read parallelizing disk I/O is a [bad idea](https://stackoverflow.com/a/1993707/3727854). That being said [this answer may be relevant](https://stackoverflow.com/a/42012661/3727854) to this question. – James Draper Sep 19 '17 at 18:32
  • 1
    @JamesDraper disk I/O won't be the limiting factor though ... bzip computations are slow – o11c Sep 19 '17 at 19:03
  • 1
    @worenga Note that jsonl is a dangerous format if you might ever use top-level numbers, since they are prone to truncation. Json-seq fixes that problem by mandating an error. – o11c Sep 19 '17 at 19:05
  • Is there a reason you're using `BZ2Compressor` rather than `BZ2File`? – o11c Sep 19 '17 at 19:26
  • @o11c BZ2File does not support mode 'a'. – worenga Sep 19 '17 at 20:55
  • @worenga Ooh, didn't realize you were stuck on python2. Still, when you pass an existing file object, that's irrelevant. – o11c Sep 19 '17 at 23:20

2 Answers2

5

A pbzip2 stream is nothing more than the concatenation of multiple bzip2 streams.

An example using the shell:

bzip2 < /usr/share/dict/words > words_x_1.bz2
cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2
time bzip2 -d < words_x_10.bz2 > /dev/null
time pbzip2 -d < words_x_10.bz2 > /dev/null

I've never used python's bz2 module, but it should be easy to close/reopen a stream in 'a'ppend mode, every so-many bytes, to get the same result. Note that if BZ2File is constructed from an existing file-like object, closing the BZ2File will not close the underlying stream (which is what you want here).

I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.

Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access. This is how the dictzip program works, though that is based on gzip.

o11c
  • 15,265
  • 4
  • 50
  • 75
2

If you absolutely must use pbzip2 on decompression this won't help you, but the alternative lbzip2 can perform multicore decompression of "normal" .bz2 files, such as those generated by Python's BZ2File or a traditional bzip2 command. This avoids the limitation of pbzip2 you're describing, where it can only achieve parallel decompression if the file is also compressed using pbzip2. See https://lbzip2.org/.

As a bonus, benchmarks suggest lbzip2 is substantially faster than pbzip2, both on decompression (by 30%) and compression (by 40%) while achieving slightly superior compression ratios. Further, its peak RAM usage is less than 50% of the RAM used by pbzip2. See https://vbtechsupport.com/1614/.

goodside
  • 4,429
  • 2
  • 22
  • 32