2

I am trying to create a hash for a string of 10 or 16 bytes. These strings are either ethernet mac address + ip address (6 + 4 bytes) or just a ipv6 ip (16 bytes).

I would ideally like to keep my cake and eat it. Absolute minimum collisions would be the priority. Hash must be less than 16 bytes long and be fairly quick. < O(n^2)

Any idea to where I should start?

Matt
  • 873
  • 1
  • 9
  • 24
  • 1
    What are the purpose and priorities of the hash? Rapid insertion? Rapid searching? Removal of duplicates? – Dancrumb Jun 06 '11 at 02:47
  • Rapid insertion i guess, using them to build a map that points to information on a host. I am reading up on the 'CHD Algorithm'. Sounds promising so far. – Matt Jun 06 '11 at 02:53
  • big-O is pretty meaningless for a fixed input size. – Nick Johnson Jun 06 '11 at 03:21
  • @Matthew Why not just use an existing hashtable implementation? Why do you need a custom hash? – Nick Johnson Jun 06 '11 at 03:22
  • @Nick, I didn't tell you which language I was using. I didn't know there are an existing hashtable implementation in c++. Do I have to go to boost? Is a vector or map just a good? AFAIK a map will find at O(logn) and uses a BST; so I would prefer to use a hashtable. – Matt Jun 06 '11 at 04:20
  • @Matthew Unless you have a compelling reason to invent your own, use an existing library. In practice you're unlikely to notice the difference between the BST and a hashtable; your own implementation will almost certainly be slower than the optimized implementation in the C++ STL. – Nick Johnson Jun 06 '11 at 05:09
  • @Nick, so just use a standard map? I will give it a shot. – Matt Jun 06 '11 at 05:17

2 Answers2

7

Perhaps I'm missing something, but if your data to hash is no longer than your hash, the obvious candidate is to use the data itself as hash - padded with,say, zeroes if shorter than 16 bytes: I doubt anything could beat that in terms of simplicity or collisions.

leonbloy
  • 73,180
  • 20
  • 142
  • 190
  • I think you are right. I think I will have to be happy with spending 16 bytes when recording and transmitting this host id. – Matt Jun 06 '11 at 02:57
  • 1
    Come to think of it, this could make a nice (if slightly mischevous) exam question... :-) – leonbloy Jun 06 '11 at 03:08
3

This question is sort-of a duplicate of Fast String Hashing Algorithm with low collision rates with 32 bit integer . If speed is your concern, start with the references there. (MurmurHash seems to be the consensus choice.)

If performance is not a big concern, just grab a SHA-1 library and use the first 16 of the 20 bytes of output. This is trivial to code using the library, and as resistant as you could want against collisions.

[edit]

As John Flatness points out in a comment, MD5 would also fit the bill. My suspicion is that it will be easier to find a SHA-1 libraries these days than MD5 (since MD5 was cracked several years ago), but whichever you have handy would be fine since this is not a cryptographic application.

Community
  • 1
  • 1
Nemo
  • 70,042
  • 10
  • 116
  • 153
  • If you're going to chop off the end of a sha1, why not use an [algorithm that already conveniently produces 16-byte hashes](http://en.wikipedia.org/wiki/MD5)? – John Flatness Jun 06 '11 at 02:54
  • MD4 may be a better idea. It is thoroughly broken, cryptographically speaking, but it is fast (faster than MD5, reputedly faster than CRC32 on some architectures) and easy to implement with compact code. See RFC 1320: http://tools.ietf.org/html/rfc1320 – Thomas Pornin Jun 07 '11 at 13:36