0

I have problem, rather close to Binary search in a sorted (memory-mapped ?) file in Java

I want to implement binary search for string in a large file with java MappedByteBuffers but in my case, large file is compressed with bzip2. Let's say that file was compressed with -1 option 100k block. (Actually I don't know exact options but I can repak file).

How should I search strings in such MappedByteBuffer? How to decompress 1 block? Is there some standart lib or I should read header, deflate section and crc? And will those block be 100k in compressed state, or 100k it's uncompressed data length? And how last block looks like?

Have somebody done BinarySearch over compressed file, maybe not with Java?

Community
  • 1
  • 1
dkiselev
  • 890
  • 8
  • 20
  • It's a bad idea anyway. I researched this as long ago as the 1970s. Binary search over what were then described as virtual arrays is extremely slow. A proper index structure performs many times as quickly. Adding compression to the mix can only make it far worse. – user207421 Jun 28 '14 at 00:32

1 Answers1

0

You would need to read the file to get an index of where each block starts. Once you have this you can do a binary search of those blocks. Note: if you have an underlying record or key, it could be split across multiple blocks.

A better solution is to build the compressed file yourself. Write a known number of records to a block and compress those individually. Additionally you can write a index to say where each block starts as well as the first key for that block. This would allow you to find the right block without decompressing all the keys and only decompress one block instead of log2(n) blocks per search.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Heh, I've forgotten that key maybe on the edge of two blocks. Looks like uncompressed file or custom compression is the only option. Thanks Peter. – dkiselev Jun 27 '14 at 11:55
  • @user1904112 You could decompress two block at a time, the problem is that if you read a random point in the file, can you scan until you find a key reliably? – Peter Lawrey Jun 27 '14 at 13:05
  • Yes, I can. But still, as I understand, I should scan blocks sequentially because of variable compressed blocks lengths. (As I found gzip/bzip don't add blocks index by default) And custom packaging still is the best option. – dkiselev Jun 27 '14 at 17:24