1

I have a rather abstract question: usual hashing algorithms (both cryptographic and non-cryptographic) change drastically if input changes even slightly.

Digest::SHA1.hexdigest 'hello'
=> "aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d"
Digest::SHA1.hexdigest 'hello!'
=> "8f7d88e901a5ad3a05d8cc0de93313fd76028f8c"

Are there hash algorithms that don't change the output when input changes slightly?

Ideally such algorithm should have a tolerance setting, which should tell how much of the input changes the hash should tolerate before changing the output.

For example, if input tolerance is 70%, these "hello" and "hello!" strings should produce the same hashed output, but if it's 95%, then these two strings should produce different (slightly) output.

Maybe it's not called hashing at all, but this area is an unknown unknown to me.

Valentin V
  • 24,971
  • 33
  • 103
  • 152

1 Answers1

1

you might look into document comparison algorithms. That's closer to the class of algorithm you need.

see

Text comparison algorithm

From there you can calculate the % that's changed. This would require that you keep the original text to compare with--not a small, hashed value. But I don't see any possible way that storing a small value, like a hash, is going to be useful to you.

user2735420
  • 106
  • 5
  • I agree. It seems logically impossible to define a non-trivial hash that is resistant to small but arbitrary change. if you can even change a single arbitrary character without affecting the hash by induction you change all the characters without affecting the hash. So the hash is trivial (a constant) for all messages. You can devise hashes that are invariant for certain transformations (say change of case or punctuation) the normal way to achieve that is to 'wash' the input (e.g. up case it or remove punctuation) before using a general hash. – Persixty Mar 06 '15 at 10:58