When does hashing work?
What hashing does is reduce search space so that equivalent items can be found more quickly. It works whenever there is a reliable way to produce a single canonical value for all members of an equivalence class.
Selecting a unique value among equivalent strings
Before hashing, the strings need to be converted to a canonical value (one unique representation among all equivalent strings).
I'm aware that even a single whitespace can change the value of a
hash, that's ok with me.
For your application, here is possible canonicalizing function that just removes whitespace:
>>> def canonical(s):
return ''.join([c for c in s if not c.isspace()])
>>> s = 'the quick\nbrown\tfox jumped'
>>> t = ' the\tquick brown fox jumped'
>>> canonical(s)
'thequickbrownfoxjumped'
>>> canonical(t)
'thequickbrownfoxjumped'
Applying a hash function
A sha256() is fast and has almost no chance of a false positive.
In Python 2, you can compute the sha256 directly from a string. However, in Python 3, the string must first be encoded into bytes:
>>> from hashlib import sha256
>>> sha256(canonical(s).encode()).hexdigest()
'2c31c202821431b015cb800ab6315289884e87f1ed023abc876915685c620919'
>>> sha256(canonical(t).encode()).hexdigest()
'2c31c202821431b015cb800ab6315289884e87f1ed023abc876915685c620919'
When won't hashing work?
If you just want to group by text similarity, hashing doesn't work as well because there isn't a straight-forward way to choose a representative element and because similarity is isn't a transitive relation (a is close to b and b is close to c doesn't imply that a is close to c).