4

I'm using the adler32 checksum algorithm to generate a number from a database id. So, when I insert a row into the database, I take the identity of that row and use it to create the checksum. The problem that I'm running into is that I just generated a repeat checksum after only 207 inserts into the database. This is much much faster than I expected. Here is my code:

String dbIdStr = Long.toString(dbId);
byte[] bytes = dbIdStr.getBytes();
Checksum checksum = new Adler32();
checksum.update(bytes, 0, bytes.length);
result = checksum.getValue();

Is there something wrong with what/how I'm doing? Should I be using a different method to create unique strings? I'm doing this because I don't want to use the db id in a url... a change to the structure of the db will break all the links out there in the world.

Thanks!

Bhavik Ambani
  • 6,557
  • 14
  • 55
  • 86
threejeez
  • 2,314
  • 6
  • 30
  • 51
  • Your code looks correct so far. Maybe your "result" is a "byte" instead of a "long"? Or "dbId" isn't as unique as you think it is... Actually, I don't understand your problem at all. You need a unique ID to identify a database. And each of your databases already *has* a unique identifier named "dbId", but you don't want to use it, because "???". So instead you take that *very same dbId*, and convert it to a string, and hash it using Adler32, and then use *that* as your "unique" ID. But that hash is still based on the dbId that you didn't want to use! – Quuxplusone Jul 22 '12 at 04:49
  • Let's say I give you a link to a post in my system: example.com/posts/1. Then let's say I need to reorg things in my db (or perhaps move to a different type of storage altogether) that causes the db id to change. Now you have a broken link. That's why I'm generating these hashes. Also, I just checked a few online hashing tools and it seems that the two id's that were colliding on my system (126 and 207) are indeed producing the same adler32 result for those tools. Ex: http://www.fileformat.info/tool/hash.htm – threejeez Jul 22 '12 at 05:00
  • Just to add to my reasoning, it seems that a number of popular sites do this same thing. Ex: http://pinterest.com/pin/186758715768172705/. I doubt they've had over 186 quadrillion posts to their site! – threejeez Jul 22 '12 at 05:03
  • If the dbId changes, then its hash (or checksum) will *also* change. Hashing the dbId doesn't buy you anything. – Quuxplusone Jul 22 '12 at 07:27
  • I'm not rehashing the id's. There is another column in the db called externalId which is where the hashed result is stored. If the dbid column changes, the external will remain the same as it's not used by the db at all. – threejeez Jul 22 '12 at 07:38
  • 1
    You could just store the old IDs if you renumber. – Tom Anderson Jul 22 '12 at 22:48
  • @TomAnderson... true. I also wonder if there's any security risk to having my db id's out there. – threejeez Jul 23 '12 at 00:12

2 Answers2

13

You should not be using Adler-32 as a hash code generator. That's not what it's for. You should use an algorithm that has good hash properties, which, among other things minimizes the probability of collisions.

You can simply use Java's hashCode method (on any object). For the String object, the hash code is the sum of the byte values of string times successive powers of 31. There can be collisions with very short strings, but it's not a horrible algorithm. It's definitely a lot better than Adler-32 as a hash algorithm.

The suggestions to use a cryptographically secure hash function (like SHA-256) are certainly overkill for your application, both in terms of execution time and hash code size. You should try Java's hashCode and see how many collisions you get. If it seems much more frequent than you'd expect for a 2-n probability (where n is the number of bits in the hash code), then you can override it with a better one. You can find a link here for decent Java hash functions.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks for the tip, however I am using Java and CityHash is C++. – threejeez Jul 22 '12 at 07:40
  • 3
    @Miles: It is more than a tip, he appears to be *the* Adler who invented that checksum. – President James K. Polk Jul 22 '12 at 14:48
  • @GregS but I would opt for SHA-256 over `hashCode`. `hashCode` is also 32 bits and collisions fully depend on the implementation for a certain class. I would be surprised if it is any better than Adler32 actually - the many naive implementations of `hashCode` are certainly a drawback in that regard. – Maarten Bodewes Jul 22 '12 at 15:10
  • Darn, I'm not sure if I'm qualified enough to downvote Mark Adler :) Mark, could you remove that last sentence? – Maarten Bodewes Jul 22 '12 at 15:18
  • @owlstead: He doesn't appear to need a cryptographic hash function for his use case. – President James K. Polk Jul 22 '12 at 17:59
  • Understood, but keep in mind that hashCode has been designed to be overridden. Make sure that your object behaves like it should, and don't expect any kind of uniqueness of the result. Hashcode may also change during the lifetime of an implementation! – Maarten Bodewes Jul 22 '12 at 19:52
  • It's only overkill if there is not enough CPU or storage space to go around. Using a suboptimal hash for the problem seems a bigger issue to me than a bit of CPU time and a maximum of 28 bytes of data overhead. That said, if it is just a checksum (why would you need that in a database?) then go right ahead and use, ehm, CRC? – Maarten Bodewes Jul 22 '12 at 23:08
  • 2
    A CRC is also not optimized as a hash algorithm. It has some good properties, but there are much better choices. – Mark Adler Jul 22 '12 at 23:15
  • @GregS, I meant the tip for the web page that was C++. Clearly Adler32 is not the best choice for this application. I've got a lot to go on thanks to this thread. I will report back with my solution in time! – threejeez Jul 23 '12 at 00:08
  • I understated the number of bits needed in the hash. You should have more than twice as many bits as the log base 2 of the number of database entries. For example if you accept a 50% chance of a single collision, a 64-bit hash allows five billion database entries. A 32-bit hash isn't long enough, but a 64-bit hash covers quite large cases. – Mark Adler Jul 23 '12 at 00:46
  • 2
    A 64-bit number can be represented using 10 characters in base 85. – Mark Adler Jul 23 '12 at 01:10
  • So, I went with Java's hashCode. It works great, but I have two concerns: 1) since hashcode returns an int the hash code can turn out to be null, which is fine for hash codes, but not for urls (it's not pretty). To solve this, I'm taking the absolute value of the returned hash code. Is this ok? 2) If my object (and/or sub-objects used in my hashCode function) change, what impact does this have on collision? Can I expect collisions sooner? I guess the same question goes for 1 as well. – threejeez Jul 23 '12 at 20:13
  • 1
    I think you mean "can turn out to be negative" (not "null"). Sure, you can take the absolute value, and you basically just lose a bit. That would double the chance of individual collisions. Since you have to code the integer into text anyway, then I would just keep the negative values and fold the sign bit into your encoding. – Mark Adler Jul 23 '12 at 23:33
  • Yes, I meant "negative", not "null". What you said makes perfect sense. Thank you very much, Mark! – threejeez Jul 27 '12 at 00:49
0

Try and use a secure hash function like SHA-256. If you ever find a collision for any data that is not binary equal, you'll get $1000 on your bank account, with compliments. Offer ends if/when SHA-2 is cracked and you enter a collision deliberately. That said, the output is 32 bytes instead of 32 bits.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • Unfortunately, SHA-256 produces strings that are way too long for a url (in terms of user friendliness). I'd need something somewhat shorter and preferably all numbers. – threejeez Jul 23 '12 at 00:06
  • OK, but with 32 bit and a large DB, you may run into a collision for about 2^16 elements (on average) because of the birthday paradox, and that's when you are using a perfect hash algorithm. Thats about 64Ki ID's. Of course, you may run into a collision a whole lot earlier if you are unlucky, or if you are using a suboptimal hash (but you already ran into that last one I guess). – Maarten Bodewes Jul 23 '12 at 00:26