3

For a Bioinformatics project I need to compress a large amount of bitstrings (strings containing only 0's and 1's) in Ruby to smaller strings to reduce memory usage.

So ideally a string like "0001010010010010010001001" becomes something like "2a452c66". I first used MD5 hashes until I read something about possible collisions which I would like to avoid.

I have tried a lot of different combinations of unpack, to_i, to_s, etc, but can't seem to get the right combination.

The solution should:

  • Keep any leading 0's.
  • Be reversible.
  • Compress (obviously).
  • And the output should avoid strange characters. Preferably I would like to stay in the alphanumeric space. (a-zA-Z0-9).

Thanks!

jlmr
  • 165
  • 10
  • Just for the record MD5 isn't reversible (no cryptographic hashes are) and you're very unlikely to get accidental collisions. – George Powell Aug 14 '13 at 16:19
  • I was aware that md5 isn't reversible but was willing to live with that if I would have been unable to find a better solution. But since my search space is huge (2^112) I thought it better to avoid even a theoretical collision. Thanks for the info though! – jlmr Aug 14 '13 at 19:15

2 Answers2

4

Try:

FORMAT = '%0.*b'

bitmask = "0001010010010010010001001"
bitmask.to_i(2) # => 2696329
hexval = bitmask.to_i(2).to_s(16) # => "292489"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "0001010010010010010001001"

What it's doing is:

  • to_i(2) converts the string from binary to its integer value just to show what's happening.
  • to_i(2).to_s(16) converts it to its hexadecimal representation as a string.
  • FORMAT contains a printf format string saying to convert the passed in value into its binary string representation (%b) with leading 0 bytes (%0b) for an unknown length (%0.*b) which it gets from the first parameter passed (bitmask.size).

Here's another example using a longer bitmask:

bitmask = "11011110101011011011111011101111"

hexval = bitmask.to_i(2).to_s(16) # => "deadbeef"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "11011110101011011011111011101111"

And longer still:

bitmask = "1101111010101101101111101110111111111110111011011010110111011010"

hexval = bitmask.to_i(2).to_s(16) # => "deadbeeffeedadda"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "1101111010101101101111101110111111111110111011011010110111011010"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • I now looked at the to_s documentation. Is there a disadvantage to using to_s(36) to compress even further? – jlmr Aug 14 '13 at 19:19
  • Probably not, it's just not going to return values that are easily understood by human eyes. That's the advantage of looking at hex, it's a direct representation of binary, if you know how to do it of course. – the Tin Man Aug 14 '13 at 21:03
3

Just an interesting observation: If you want to convert a base-2 string into a higher base (say base-n), the compression ratio is 1/log2(n). This means that if you convert to hex as the other answer suggests, you'll get compression of 25% of the original. Moving all the way up to base 64 (only 2 more symbols than pure alphanumeric) you'll jump to about 17% compression. It just depends where you want to sit on that tradeoff!

Alternatively, if you could get rid of the reversibility requirement, only preserving equality, MD5 would be fine. See How many random elements before MD5 produces collisions? spoiler: it's a lot. The collision "problems" you will read about are purposeful collisions; cryptographers using their knowledge of MD5 to find collisions. Accidental collisions are impossible for all practical purposes.

Update

In terms of actually implementing a base64 encode in Ruby, I don't know. I don't actually know Ruby. What I'd do if it's not natively supported is make an array of all the alphanumeric characters + 2 (so the array is 64 long) and then convert chunks of 6 binary digits into the corresponding character by using that 6 digit binary number as an index in the character array. If you wanted to use 62 (or any other non-power-of-two) then the algorithm would be different.

Community
  • 1
  • 1
George Powell
  • 9,141
  • 8
  • 35
  • 43
  • I read in the documentation that Ruby is only able to encode up to base36. Is it possible to go to base 64? Or 62 to stay in alphanumeric? – jlmr Aug 16 '13 at 13:15