9

Let's say you have these two sequences of strings

abc cba bc

bc abc cba

I'm trying to create a mapping for such sequences(the sequence is also a string) so that the above two sequences are mapped into the same bucket.

My initial thought would be to add the results of a hashing function that is applied to each string separately. In this way their order won't matter. If I applied the hashing function to the sequence string as a whole, then of course the hash result would be different.

However I'm very new to the world of string hashing functions and I have no idea whether this approach would be efficient.

In this website http://www.partow.net/programming/hashfunctions/index.html

I found many different implementations for string hashing, however I'm not sure which one would be the "best" for my needs.

Some technical details about each string in the sequence is that each of them won't have more than 25 characters. Also each sequence won't have more than 3 strings.

Questions

1. Would this approach of adding the results of a string hashing function to each string of the sequence work?

2. If yes which string hashing function should I use that would give a low amount of collisions and also be time efficient?

Thank you in advance

Community
  • 1
  • 1
ksm001
  • 3,772
  • 10
  • 36
  • 57
  • 1
    Would it be useful to apply the hashing function to a sorted copy of the string sequence? – Roger Rowland Apr 01 '13 at 10:26
  • what is the size of the alphabet (ie. what character set will be used)? – didierc Apr 01 '13 at 10:26
  • You want them in the same bucket, but NOT to collide? Tall order. – WhozCraig Apr 01 '13 at 10:27
  • if you sort the sequence you don't even need hashing, just compare strings with the same rank. – didierc Apr 01 '13 at 10:29
  • roger_rowland, I thought about this however sorting the sequence would be O(klogk) where k is the amount of strings in the sequence, and also even if I use a hashing later on, I would have at least O(n) for the hash to be generated. I would like to avoid the extra O(klogk) cost if possible. didierc, the alphabet would the english alphabet(capital letters included) – ksm001 Apr 01 '13 at 10:30
  • Sorting a sequence of three strings is hardly overkill. The fact there are at-most three, and only three is a major bonus for including a 3-element sort in your hash function. An unwound set of if-elses would work. – WhozCraig Apr 01 '13 at 10:32
  • WhozCraig, you are right but I'm not sure what would happen if I had many sequences with three 25 character strings where they had only the last letter different. The sorting phase would take a lot of time in order to see which string should go first in the final sequence and which should go second. There will be some overall extra cost if I have many sequences of strings, which I would like to avoid if possible. – ksm001 Apr 01 '13 at 10:36
  • 1
    for addition I suggest using XOR. – Karoly Horvath Apr 01 '13 at 10:54

3 Answers3

2

Just the idea demonstration (very inefficient string copying), complexity O(NlogN) where N is the size of the key (=== O(1) if your keys have constant length known at compile time), I don't think you can do better complexity:

#include <boost/functional/hash.hpp>
#include <set>
#include <algorithm>

std::size_t make_hash(
  std::string const& a,
  std::string const& b,
  std::string const& c)
{
    std::string input[] = {a,b,c};
    std::sort(input, input + (sizeof(input)/sizeof(*input)));
    return boost::hash_range(input, input + (sizeof(input)/sizeof(*input)));
}

#include <iostream>
// g++ -I.../boost_1_47_0 string_set_hash.cpp
int main()
{
    std::cout << make_hash("abc", "bcd", "def") << std::endl; // 46247451276990640
    std::cout << make_hash("bcd", "def", "abc") << std::endl; // 46247451276990640
}

A fragment of boost/functional/hash.hpp for reference:

template <class T>
inline void hash_combine(std::size_t& seed, T const& v)

{
    boost::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

template <class It>
inline std::size_t hash_range(It first, It last)
{
    std::size_t seed = 0;

    for(; first != last; ++first)
    {
        hash_combine(seed, *first);
    }

    return seed;
}
bobah
  • 18,364
  • 2
  • 37
  • 70
  • thank you for your suggestion, wouldn't implementing however your own hash function in the way I described avoid the extra cost of sorting? Because finding the hash of the string would be at least O(N), however taking under account the fact that I can use at most three times a hash function to each string of the sequence, that would give O(Ki) complexity where i is the i-th string of the sequence, the overall performance would be O(K1 + K2 + ...) = O(N). – ksm001 Apr 01 '13 at 10:52
  • Why is this better than combining the individual string hashes using a symmetric operation like addition? – Mike Seymour Apr 01 '13 at 10:52
  • @MikeSeymour - if you show the proof that addition preserves uniform keys distribution I will be happy to delete my answer – bobah Apr 01 '13 at 11:02
  • @bobah: I'm not suggesting the answer is wrong; I'd just like to see a justification for the increased complexity. (I don't have time to prove it, but I'm pretty sure that exclusive-or would preserve the distribution; I'd use that rather than addition). – Mike Seymour Apr 01 '13 at 11:09
  • @MikeSeymour - I trust boost hash library writer as an expert in good hash functions and suggested the answer using the existing API of the boost::hash. I have added a note about complexity, if the key size is small and fixed then sorting is extra NlogN vs N for XOR-ing. – bobah Apr 01 '13 at 11:35
  • @ksm001 - you may well win on the total time over a large data set from a better hash function even if you pay extra sorting cost, do not drop it till you prove it's bad in experiment – bobah Apr 01 '13 at 13:39
0

Whatever hashing function you pick, you want an operator for the final combination of each individual hash which would be:

  • commutative
  • associative

the sum, the product, and the exclusive or come to mind as candidates for integral values. So yes, adding would work. You would still have collisions on unrelated sequences which need to be resolved though, so you would need a string comparison function, but permutations of the same set of strings would end up in the same bucket.

You could also reverse the operation order: add the strings character-wise together first (eg. adding "ab" and "cba" becomes ('a' + 'c')('b' + 'b')('\0' + 'a') with carry propagation for sum or product, so perhaps xor is an interesting candidate here), and then apply a hash function. You could even combine these two operations while performing them (pseudo code follows):

int hash(string a, string b, string c){
    int r = 0, k;
    int m = max(a.length(), max(b.length(), c.length()));
    for (int i = 0; i < m; i++) {
        k = ( i < a.length()? a[i] : 0) ^
              (i < b.length()? b[i] : 0) ^
              (i < c.length()? c[i] : 0);
        r = hash(r,k);
    }
    return r;
}

With hash the incremental hashing function. A simple modulo against a prime number large enough (ie. larger than the expected size of the bucket array) should be alright for normal purposes.

A completely different (and better?) solution is to simply sort the sequence (3 entries means quasi constant time), then make a ordered map with the comparison function considering the strings as a "digit" of a 3 digits number. But this is out of the scope of the question.

didierc
  • 14,572
  • 3
  • 32
  • 52
0

I would hash each element individually.

Then sort those hashes. Sorting 3 size_t is fast.

Then chain those hashes. Your library may have hash chain functions, or even use hash( a+b+c ) with overflow wrap.

Avoid xor, because xor two identical hash values is zero. And hash of identical strings is identical. So a naive xor can lead to ( a,a,b ) and ( c,c,b ) having the same hash output, which sucks.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524