0

I want to implement a hashing technique in C where all the permutation of a string have same hash keys.
e.g. abc & cab both should have same keys.

I have thought of adding the ascii values & then checking frequency of characters[important otherwise both abc & aad would have same keys which we do not want].
But, it doesn't seem to be much efficient.

Is there any better hashing function which resolves collisions well & also doesn't result into sparse hash table?

Which hashing technique is used internally by Java [for strings] which not only minimizes the collisions but also the operations[insertion ,deletion, search] are fast enough?

Emil Vikström
  • 90,431
  • 16
  • 141
  • 175
Green goblin
  • 9,898
  • 13
  • 71
  • 100
  • Here's a [similar question](http://stackoverflow.com/questions/1536393/good-hash-function-for-permutations) that might enlighten you... – Eitan T Jun 24 '12 at 15:34

4 Answers4

12

Why not sort the string's characters before hashing?

David Kanarek
  • 12,611
  • 5
  • 45
  • 62
  • What if the string is several megabytes long? – Tony Ennis Jun 24 '12 at 14:42
  • 2
    Any processing is going to be at least O(n) time for this. Sorting should be possible in O(nlogn) time and O(n) space. Depending on the performance requirements, these increases may be too much, but they don't seem crazy to me. – David Kanarek Jun 24 '12 at 14:48
  • 1
    @Tony, David: Actually, a string can be sorted in O(n) time (counting sort). – Oliver Charlesworth Jun 24 '12 at 14:50
  • @OliCharlesworth well, watch me delete my answer after this! Thanks! – Tony Ennis Jun 24 '12 at 14:55
  • @OliCharlesworth, Since it's ASCII, you're right. But instead of doing the full counting sort, just count and hash the counters. – ugoren Jun 24 '12 at 14:59
  • @ugoren: Yes, that sounds quicker. – Oliver Charlesworth Jun 24 '12 at 15:00
  • I believe at this point you could just sum all the characters, and hash the sum. Is there a point to summing the counts into buckets first, which I surmise is that @OliCharlesworth 's 'counting sort' does? Summing the characters and isn't exactly the same as summing the counts of the characters, of course. That is a+b+c = 1+2+3 if we sum the characters. It's 1+1+1 if we sum the counts. – Tony Ennis Jun 24 '12 at 15:05
  • @TonyEnnis: Summing the characters make it easy to find collisions (e.g. `ad` and `bc`). – Oliver Charlesworth Jun 24 '12 at 15:06
  • By hashing the counters I believe he means to include their positions, not just straight counts. So aa is [2,0....] while bb is [0,2,....] – David Kanarek Jun 24 '12 at 15:07
  • @DavidKanarek So if the string is `abcdabc`, we get 1+2+3+4+1+2+3. By summing the counts we get [2,2,2,1]. What's the operation on the array? Multiplying the counts by successively larger prime numbers might be good, if expensive. A straight sum like I suggested may be more 'clumpy.' – Tony Ennis Jun 24 '12 at 15:17
  • @Tony I don't have an answer for the hashing algorithm. I'd just sort and hash the string with something common. I just wanted to clarify the information content of what was being hashed. I don't think any algorithm that treats "aa" and "b" as equal is going to be very effective, though I would be interested if anyone wants to try an experiment. – David Kanarek Jun 24 '12 at 15:20
  • @David Life is a trade-off :-) What we save in preventing collisions may be lost in time to compute the better key. Hard to say, without knowing the details. Fun topic! – Tony Ennis Jun 24 '12 at 15:31
4

The obvious technique is to simply sort the string. You could simply use the sorted string as the lookup key, or you can hash it with any algorithm deemed appropriate. Or you could use a run-length encoded (RLE) representation of your string (so the RLE of banana would be a3bn2), and optionally hash that.

A lot depends on what you're going to do with the hashes, and how resistant they must be to collisions. A simple CRC (cylic redundancy checksum) might be adequate, or it might be that cryptographic checksums such as MD5 or SHA1 are not secure enough for you.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
2

Which hashing technique is used internally by Java [for strings] which not only minimizes the collisions but also the operations[insertion ,deletion, search] are fast enough?

The basic "trick" used in Java for speed is caching of the hash value making it a member variable of a String and so you only compute it once. BUT this can only work in Java since strings are immutable.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
1

The main rule about hashing is "Don't invent your own hashing algorithm. Ever.". You could just sort characters in string and apply standard hashing strategy.

Also read that if you are interested in hashing.

kamilla-violi
  • 121
  • 1
  • 6
  • 1
    For a security stand point that is absolutely true, but you won't find any secure hash algorithm witch produce the same hash for permutations. And that's what algogeek is looking for! – Sascha Jun 24 '12 at 14:55