0

I have a very long string that I need to compare for equality. Since comparing them char by char is very time consuming, I like to create a hash for the string.

I like the generated hash code be unique ( or the chance that two string with the same hash generated, be very small). I think creating an int from a string as hash is not strong enough to eliminate of having two different string with the same hash code, so I am looking for a string hash code.

Am I right that the above assumption?

To clarify, assume that I have a string of say 1K length and I create a hash code of 10 char, then comparison hash codes speed up by 100 times.

The question that I have is how to create such hash code in c++?

I am developing on windows using visual studio 2012.

mans
  • 17,104
  • 45
  • 172
  • 321
  • 4
    Try use standart library: std::hash, http://en.cppreference.com/w/cpp/utility/hash – user1837009 Aug 20 '13 at 11:12
  • A similar question http://stackoverflow.com/questions/8094790/how-to-get-hash-code-of-a-string-in-c – MARK Aug 20 '13 at 11:13
  • 2
    Just to be sure -- you want to compare one string with others multiple times, right? Otherwise, you won't get any speedup. Hashing a string is linear in string length for non-trivial hash functions, just as comparison is. – arne Aug 20 '13 at 11:38
  • 3
    @arne It's worse than that. Hashing requires looking at every character in the string; with comparison, you stop at the first characters that aren't equal. – James Kanze Aug 20 '13 at 12:05
  • @JamesKanze: In practice, you're right, but from a computational complexity point-of-view, the worst case complexity of comparison is still linear with string length, i.e. if the last character differs. – arne Aug 20 '13 at 12:09
  • @arne From the complexity point-of-view, yes. They're both O(n). But the constant factor will probably be around an order of magnitude less for the comparison. (If we know something about the contents of the strings, both the hash and the comparison can be made faster.) – James Kanze Aug 20 '13 at 12:13
  • Is the hash for each string something that will be used many more times than it is calculated? Is there any characteristic of the strings that would make them more likely to match or partially match than uniformly random letters of a fixed length? – sh1 Aug 20 '13 at 12:44

5 Answers5

4

To be useful in this case, the hash code must be quick to calculate. Using anything larger than the largest words supported by the hardware (usually 64 bits) may be counter productive. Still, you can give it a try. I've found the following to work fairly well:

unsigned long long
hash( std::string const& s )
{
    unsigned long long results = 12345; //  anything but 0 is probably OK.
    for ( auto current = s.begin(); current != s.end(); ++ current ) {
        results = 127 * results + static_cast<unsigned char>( *current );
    }
    return results;
}

Using a hash like this will probably not be advantageous, however, unless most of the comparisons are with strings that aren't equal, but have long common initial sequences. Remember that if the hashes are equal, you still have to compare the strings, and that comparison only has to go until the first characters which aren't equal. (In fact, most of the comparison functions I've seen start by comparing length, and only compare characters if the strings are of equal length.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329
1

There are a lot of hashing algorithms present which you may use.

If you want to implement one by yourself, then a simple one could be to take the ascii for each character and align it with 0(i.e. a = 1, b = 2...) and multiply it with the character index in the string. Keep adding these values and store it as a hash value for a particular string.

For example tha hash value for abc would be:

HASH("abc") = 1*1 + 2*2 + 3*3 = 14; 

The probability of collision lowers as the string length increases(Considering your strings will be lengthy).

Saksham
  • 9,037
  • 7
  • 45
  • 73
  • Have you done any measurements to show that this hashing is effective? I rather doubt that it would be (but I've not actually tried it either). – James Kanze Aug 20 '13 at 12:06
  • No I didn't try it either. I agree that there will be collisions but I suppose that even your method will have. – Saksham Aug 20 '13 at 13:04
  • There will always be some collisions; the goal is to minimize them. Your suggestion will lead to a lot of collisions. – James Kanze Aug 20 '13 at 17:08
  • @JamesKanze I do not think so as each letter value is being multiplied by its index as well, Also, as the OP said that the string length is very long so there may be very less collisions. – Saksham Aug 20 '13 at 17:16
  • What you think isn't really relevant. Multiplying each letter by its index will end up causing some letters to be excluded from the hash, so it is, a priori, a bad hashing algorithm. (For short strings, it's particularly bad, but that isn't a problem here.) – James Kanze Aug 20 '13 at 17:19
  • @JamesKanze seems you haven't understood my algorithm. lets agree to disagree. – Saksham Aug 20 '13 at 17:37
  • I have understood it; that's why I can safely say that it isn't a particularly good algorithm for hashing. – James Kanze Aug 20 '13 at 17:50
  • @JamesKanze alright. thanks for your input. really appreciate it – Saksham Aug 21 '13 at 02:44
0

There are many known hash algorithms available. For example MD5, SHA1, etc. You should not need to implement your own algorithm but use one of the available ones. Use the search engine of your choice to find implementations like this one.

Martin Höller
  • 2,714
  • 26
  • 44
  • Crypto hashes like MD5 are much more expensive to calculate than a simple straight forward hash, and are probably not much better for this particular use. – James Kanze Aug 20 '13 at 12:07
0

It really depends what your hard requirements are. If you have hard requirements such as "search may never take more than so and so long", then it's possible that no solution is applicable. If your intent is simply to speed up a large number of searches, then a simple, short hash will do fine.

While it is generally true that hashing a 1000-character string to an integer (a single 32-bit or 64-bit number) can, and eventually will produce collisions, this is not something to be concerned about.
A 10-charcter hash will also produce collisions. This is a necessary consequence of the fact that 1000 > 10. For every 10-character hash, there exist 100 1000-character strings1.

The important question is whether you will actually see collisions, how often you will see them, and whether it matters at all. Whether (or how likely) you see a collision depends not on the length of the strings, but on the number of distinct strings.
If you hash 77,100 strings (longer than 4 characters) using a 32-bit hash, you have a 50% chance of encountering a collision for each new hash. At 25,000 strings, the likelihood is only somewhere around 5-6%. At 1000 strings, the likelihood is approximately 0.1%.
Note that when I say "50% at 77,100 strings", this does not mean that your chance of actually encountering a collision is that high. It's merely the chance of having two strings with the same hash value. Unless that's the case for the majority of strings, the chance of actually hitting one is again a lot lower.

Which means no more and no less than for most usage cases, it simply doesn't matter. Unless you want to hash hundred thousands of strings, stop worrying now and use a 32-bit hash.
Otherwise, unless you want to hash billions of strings, stop worrying here and use a 64-bit hash.

Thing is, you must be prepared to handle collisions in any case because as long as you have 2 strings, the likelihood for a collision is never exactly zero. Even hashing only 2 or 3 1000-character strings into a 500-byte hash could in principle have a collision (very unlikely but possible).
Which means you must do a string comparison if the hash matches in either case, no matter how long (or how good or bad) your hash is.

If collisions don't happen every time, they're entirely irrelevant. If you have a lot of collisions in your table and encounter one, say, on 1 in 10,000 lookups (which is a lot!), it has no practical impact. Yes, you will have to do a useless string comparison once in 10,000 lookups, but the other 9,999 work by comparing a single integer alone. Unless you have a hard realtime requirement, the measurable impact is exactly zero.
Even if you totally screw up and encounter a collision on every 5th search (pretty desastrous case, that would mean that roughly 800 million string pairs collide, which is only possible at all with at least 1,6 billion strings), this still means that 4 out of 5 searches don't hit a collision, so you still discard 80% of non-matches without doing a comparison.

On the other hand, generating a 10-character hash is cumbersome and slow, and you are likely to create a hash function that has more collision (because of bad design) than a readily existing 32 or 64 bit hash.
Cryptographic hash functions are certainly better, but they run slower than their non-cryptographic counterparts too, and the storage required to store their 16 or 32 byte hash values is much larger, too (at virtually no benefit, to most people). This is a space/time tradeoff.

Personally, I would just use something like djb2, which can be implemented in 3 lines of C code, works well, and runs very fast. There exist of course many other hash functions that you could use, but I like djb2 for its simplicity.

Funnily, after reading James Kanze's answer, posted code seems to be a variation of djb2, only with a different seed and a different multiplier (5381 and 33, respectively).
In the same answer, the remark about comparing string lengths first is a good tip as well. It's noteworthy that you can consider a string's lenght a form of "hash function", too (albeit a rather weak one, but one that often comes "for free").


1However, strings are not some "random binary garbage" as the hash is. They are structured, low-entropy data. Insofar, the comparison does not really hold true.
Damon
  • 67,688
  • 20
  • 135
  • 185
0

Well, I would first compare string lenghts. If they match, then I'd start comparing using an algorithm that uses random positions to test character equality, and stop at first difference. The random positions would be obtained from a stringLength sized vector, filled with random ints ranging from 0 to stringLength-1. I haven't measured this method, though, it's just an idea. But this would save you the concerns of hash collisions, while reducing comparison times.

valir
  • 311
  • 2
  • 6