-4

For strings of 10-50 characters:

double hash(const std::string & str)
{
    double result = 0;
    int n=str.length();
    for(int i=0;i<n;i++)
    {
        result += (str[i] - '@')*pow(256.0,i);
    }
    return result;
}

can this be used in a production code?

  • increasing total throughput of hashing when used with std::hash by ILP
  • correctness/uniqueness
  • scalability

new version by comments:

double hash(const std::string & str)
{
    double result = 0;
    int n=str.length();

    // maybe using multiple adders to do concurrently multiple chars
    // since they are not dependent
    for(int i=0;i<n;i++)
    {
        result += lookupCharDoubleType[str[i]]*lookupPow[i];
    }
    return result;
}

another version by another comment:

double hash(const std::string & str)
{
    double result = 0;
    int n=str.length();

    for(int i=0;i<n;i++)
    {
        result = result * 256.0 + lookupCharDoubleType[str[i]];
    }
    return result;
}
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • 1
    why you do not use buildin one? – apple apple Sep 26 '18 at 13:03
  • 7
    You normally want a hash to be an integer. –  Sep 26 '18 at 13:03
  • 2
    What if it will be only used for uniqueness? Not for indexing or something? And if I'm trying some algorithms to learn? – huseyin tugrul buyukisik Sep 26 '18 at 13:04
  • 1
    There must be some reason that everyone uses prime number multipliers in hashes, don't you think? – NiVeR Sep 26 '18 at 13:04
  • 1
    Unless you have some sort of performance bottleneck just use `std::hash`. If you just want to check if two values are equal or not just use `operator ==` – NathanOliver Sep 26 '18 at 13:05
  • 4
    If you're going to use `pow`, don't -- there is no need to. Just use a table of precomputed powers of 256 and do a lookup using `i` as the index. – PaulMcKenzie Sep 26 '18 at 13:06
  • 2
    Also, 256 ^ 50 is going to be a huge number and wont be representable as a 64bit integer, which due to floating point inaccuracy, you should be using for the hash value. – NathanOliver Sep 26 '18 at 13:08
  • 2
    @NathanOliver can std::hash be used by instruction level parallelism with this function to increase total throughput of hashing using integer+fp pipelines together? – huseyin tugrul buyukisik Sep 26 '18 at 13:08
  • 3
    If you're going to use values, precompute them. In a strange way, you're defeating the purpose of the hash by using non-hashing techniques (like computing the same value over and over again instead of a table lookup). – PaulMcKenzie Sep 26 '18 at 13:10
  • well, the lookup may not be necessary since you can use `result = result*256+(str[i] - '@')` (to some degree) – apple apple Sep 26 '18 at 13:23
  • 1
    I think my answer shows that this question is *not* "primarily opinion-based" and can be answered objectively. We should reopen it. Given the purpose of a hash function and the OP's stated goal of determining uniqueness, we can say without doubt that the function in question is not a good choice. Also: this is an honest question that provides enough information to answer it, including code. I don't think it deserves all those downvotes. – Caleb Sep 26 '18 at 13:57
  • Compute an md5 / sha1 / sha256 and then sum the digest bytes into an `unsigned int` (letting it wrap)? – Paul Sanders Sep 26 '18 at 14:03
  • 1
    I don't understand the downvoting either. Comments were given as to how to improve the function in terms of time taken, and answers were given as to why it is not a good hash for uniqueness. – PaulMcKenzie Sep 26 '18 at 15:52
  • Simple rule of thumb. If you are not an expert in Cryptogrophy or Hashing don't write your own Crypt / Hash code. These are exceedingly specialized fields and getting it correct (or even close to good) is very difficult (and proving it is good is even harder). – Martin York Sep 26 '18 at 16:46
  • @MartinYork can we use genetic algorithm to produce one? Maybe by giving parameters as functions(such as modulo or multiplication)and variables (ordinary numbers) and expect it to minimize hash collisions and evolve those ordinary numbers automatically to magical numbers? – huseyin tugrul buyukisik Sep 26 '18 at 16:49
  • 1
    Why not find some already written algorithms for hashing. They will generally have an analysis on how good they are in terms of collisions and speed. You will trade usually trade one for the other. `std::hash` is your best option. – Martin York Sep 26 '18 at 16:51
  • @MartinYork thank you very much. – huseyin tugrul buyukisik Sep 26 '18 at 16:53

1 Answers1

4

Is this a good hashing function for short strings?

No, it's not a good hash for uniqueness. You're basically mapping the string onto a double. For a string that's 50 characters long, you're going to get a value on the order of 256 ^^ 50, which is 2.58e120. That's well within a double's range, which is 1.7e308, but you have to understand that double doesn't represent numbers exactly -- it's only 8 bytes long after all.

Your code maps a string to a double as if the characters were base-256 digits, with the first character being the least significant digit:

The string hello maps like this:

'h' * 256^^0 + 'e'*256^^1 + 'l' * 256^^2 + 'l' * 256^^3 + 'o' * 256^^4

For a string larger than a few bytes, the last characters will be by far the most significant part of the result, and all the other characters will get dropped entirely because a double doesn't have the precision to represent all those bits.

The end result is that your hash function will only consider the last few characters. A good hash function should change whenever any of the characters in the string change, so that strings that are similar but not exactly the same are extremely unlikely to have the same hash value. With your function, hash values are likely to be the same as long as the last several characters are the same.

Caleb
  • 124,013
  • 19
  • 183
  • 272
  • 1
    can you explain more? – apple apple Sep 26 '18 at 13:15
  • Sorry, folks, I hit the submit button well before I was done answering. – Caleb Sep 26 '18 at 13:26
  • 1
    FWIW, in those cases, delete the answer, edit the rest in, and then undelete. – NathanOliver Sep 26 '18 at 13:27
  • 1
    @huseyintugrulbuyukisik From [this answer](https://stackoverflow.com/a/13543600/643383), a double has 53 bits of precision. 53/8=6.625, so only about the last 6 and a half characters in your string will matter. But don't let me stop you -- you'll learn a lot more if you actually try it. Build a simple version of your function (don't worry about the lookup tables etc) and try hashing different equal-length strings where the last 8 or 10 characters are the same. – Caleb Sep 26 '18 at 13:43
  • 1
    Another option is to quickly edit the answer you accidentally posted and add <>, save, then do your edit. – Gem Taylor Sep 26 '18 at 14:12