1

So I was looking at the hash functions, and figured out that given 2 similar strings, even if the differ by a single bit, the result would be a completely different hash key. I actually need to create some sort of unique id, which has this feature of being quite similar for similar input (will be millions of alpha numerical strings).

Example:

  • two equal strings must have the same hash.
  • two different strings must have different hash.
  • two different strings, that are quite similar must have different hashes that at the same time are not too far from each other.

what would be a good approach to achieve that? I am using python.

Danny
  • 344
  • 1
  • 4
  • 15
  • 1
    "two different strings must have different hash." This is impossible unless the hashes are longer than the longest possible string. "two different strings, that are quite similar must have different hashes that at the same time are not too far from each other." If you don't need cryptographic security you could use some reduced-round version of a hash function. – agf Oct 11 '11 at 23:34
  • 4
    use the strings raw and store some form of levenstein distance? – Mitch Wheat Oct 11 '11 at 23:35
  • Can you explain what you're trying to do with this? There might be a better way to achieve your end-result. – beerbajay Oct 11 '11 at 23:36
  • I don't need cryptographic security. When I say they must be different I mean not to have collisions, as far as possible. – Danny Oct 11 '11 at 23:37
  • I need to compare non homogeneous data (numbers, and strings combined), which can be subject to mistakes such as mistyping errors – Danny Oct 11 '11 at 23:39
  • Why do you need to hash it at all? If they are same, the hash is same. If they are slightly different, the hashes are slightly different. If they are completely unrelated, the hashes are completely unrelated, __If hash(myStr)=myStr__, please correct me if I am wrong – Lelouch Lamperouge Oct 11 '11 at 23:46
  • well, "asd" and "asc" differ by one letter, but hash("asd")=1585925417 and hash("asc")=1585925424 so they differ by 2 digits now. 2>1 is a worse result, it's like i figured out the two strings were less similar than before. I need to improve instead and be more accurate – Danny Oct 12 '11 at 00:01

3 Answers3

1

What you're asking for is not possible, assuming by 'similar hash' you mean that the values should be of similar magnitude - eg, 12345 is similar to 12346 but not to 92345. The reason for this is that similarity of that sort is one dimensional (a number line), but the ways in which strings can be similar to each other has no fixed dimension (eg, 'foo', 'fob' and 'fod' all have distance 1 to each other).

If you want to perform fuzzy matching, you will instead need to use a different method of indexing your text, like this or this.

If you just want to compare individual values for similarity, don't hash them in the first place - just compute their edit distance immediately.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
0

If you're sure that you always have alphanumeric data than I would recommend using a base 36 (or higher) algorithm.

You can use the method I gave as an answer to this question: Base 62 conversion

import string
BASE_LIST = string.digits + string.letters
BASE_DICT = dict((c, i) for i, c in enumerate(BASE_LIST))

def base_decode(string, reverse_base=BASE_DICT):
    length = len(reverse_base)
    ret = 0
    for i, c in enumerate(string[::-1]):
        ret += (length ** i) * reverse_base[c]

    return ret

def base_encode(integer, base=BASE_LIST):
    length = len(base)
    ret = ''
    while integer != 0:
        ret = base[integer % length] + ret
        integer /= length

    return ret

Example usage:

for i in range(100):                                    
    print i, base_decode(base_encode(i)), base_encode(i)
Community
  • 1
  • 1
Wolph
  • 78,177
  • 11
  • 137
  • 148
0

I believe the below satisfies your stated requirements.

def gethash(data):
  u"given a character string return an integer hash value"
  return reduce(lambda b1, b2: (b1 << 8) + b2,
      imap(ord, unicodedata.normalize('NFC', data).encode('UTF-8')))

Essentially the hash value is the complete binary value of the UTF-8 encoded byte values of the input as a single integer. Similar character strings produce hash values with similar bits (not always with a small subtractive difference, but you did not specify that). Normalization causes strings u'A\u030a' and u'\xc5' to have the same hash value.

If you want to limit the maximum value, then simply apply modulo division (by 2^32 maybe) as a final step.

wberry
  • 18,519
  • 8
  • 53
  • 85