4

bzip2 compresses the data in blocks, where each block starts with a magic number 1AY&SY.

Can we determine the size of uncompressed data behind each block??

One way to do is to decompress the bzip2 file block-by-block and then find the size of each decompressed block. BUT I am trying to find a way which does not involve decompression and I can learn the size of uncompressed block during compression time.

The use case of it is that we need to tell the decompressing tool what would be the maximum size of decompressed block, so that it allocates sufficient memory. The decompression will be done in an embedded platform, so we have limited resources.

bzip2 header format for a block also does not contain any information about what will be the size of decompressed block. See wikipedia page for the bzip2 file format.

Note: I need a solution in terms of code in C, as I am using bzip2 in my console app developed in C and it runs on Linux and Windows both.

Community
  • 1
  • 1
Zeeshan
  • 539
  • 4
  • 19
  • Also see the Bzip manual and [Utility functions | BZ2_bzBuffToBuffDecompress](http://www.bzip.org/1.0.3/html/util-fns.html): *"Because the compression ratio of the compressed data cannot be known in advance, there is no easy way to guarantee that the output buffer will be big enough. You may of course make arrangements in your code to record the size of the uncompressed data, but such a mechanism is beyond the scope of this library..."* – jww Jun 29 '18 at 08:05

2 Answers2

1

bzip2 header format for a block also does not contain any information about what will be the size of decompressed block. See wikipedia page for the bzip2 file format.

The above statement answers your own question. You can't because it's not available before decompression. It does not encode the block size before compression anywhere in the header, comfirmed here...

http://www.forensicswiki.org/wiki/Bzip2

You must decompress each bloc in order to know it's size.

Harry
  • 11,298
  • 1
  • 29
  • 43
  • Perhaps can we change the source code of bzip2 to make this information available? If you know the area in code which should be manipulated? – Zeeshan Apr 06 '16 at 08:55
  • Yes you could change it. You should get an example of what you've tried to do already and post it in your question. – Harry Apr 06 '16 at 16:37
  • I looked into code but could not find a way, thats why asked if someone would know a way. – Zeeshan Apr 07 '16 at 05:33
  • Are you prepared to fork bzip2? Anything you create will not work with any other bzip2 that expects the standard header. – Harry Apr 07 '16 at 05:39
  • I don't want to change bzip2 header, but get this information and send it in some other way to the decompressor tool. – Zeeshan Apr 07 '16 at 05:42
0

There is only information about the block size that is used in the encoding after the initial run length encoding has been done. So, as the article mentions, in the worst case you may get 46MB of decompressed data from one block and all you know is that the output before RLE reversal is 900kB.

So, in effect, the only way to do this is to decompress the file at least to the RLE stage and calculate the size based on that.

Sami Kuhmonen
  • 30,146
  • 9
  • 61
  • 74