11

Say I have a bzip2 file (over 5GB), and I want to decompress only block #x, because there is where my data is (block is different every time). How would I do this?

I thought about making an index of where all the blocks are, then cut the block I need from the file and apply bzip2recover to it.

I also thought about compressing say 1MB at a time, then appending this to a file (and recording the location), and simply grabbing the file when I need it, but I'd rather keep the original bzip2 file intact.

My preferred language is Ruby, but any language's solution is fine by me (as long as I understand the principle).

niton
  • 8,771
  • 21
  • 32
  • 52
user163365
  • 1,317
  • 9
  • 14

2 Answers2

7

There is a http://bitbucket.org/james_taylor/seek-bzip2

Grab the source, compile it.

Run with

./seek-bzip2  32 < bzip_compressed.bz2 

to test.

the only param is bit displacement of wondered block header. You can get it with finding a "31 41 59 26 53 59 " hex string in the binary file. THIS WAS INCORRECT. Block start may be not aligned to byte boundary, so you should search for every possible bit shifts of "31 41 59 26 53 59" hex string, as it is done in bzip2recover - http://www.bzip.org/1.0.3/html/recovering.html

32 is bit size of "BZh1" header where 1 can be any digit from "1" to "9" (in classic bzip2) - it is a (uncompressed) block size in hundreds of kb (not exact).

osgx
  • 90,338
  • 53
  • 357
  • 513
  • sic! block start can be not a byte boundary :( There is a bzip-table programm included in "seek-bzip2" to get list of bit displacement and sizes of original data block sizes. – osgx Sep 13 '10 at 16:35
  • unfortunatly, "bzip-table" is almost the same speed as actual decompressing :(. It do almost full decompress cycle, but don't check CRC. – osgx Sep 14 '10 at 15:06
  • Also, take a look on parallel bzips, like pbzip2 by Jeff Gilchrist. In parallel decompression it needs to search a block headers. Code: http://www.google.com/codesearch/p?hl=en#calSvFpbfuI/trunk/trunk/demo/pbzip2-1.0.2/pbzip2.cpp&q=pbzip2&sa=N&cd=2&ct=rc&l=3 `producer_decompress` function – osgx Sep 14 '10 at 21:15
2

It's true that bzip-table is almost as slow as decompressing but of course you only have to do it once and you can store the output in some fashion to use as an index. This is perfect for what I need but may not be what everybody needs.

I did need a little help getting it to compile on Windows though.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • http://sourceforge.net/projects/mingw/files/Automated%20MinGW%20Installer/mingw-get-inst/mingw-get-inst-20110316/mingw-get-inst-20110316.exe/download – osgx Mar 30 '11 at 01:07