1

How can I detect the small differences between two strings with the MD5 algorithm? I want to find the percentage of similarity between a few large strings. As how can I check the difference since :

MD5("The quick brown fox jumps over the lazy dog.")
= e4d909c290d0fb1ca068ffaddf22cbd0

MD5("The quick brown fox jumps over the lazy dog")
= 9e107d9d372bb6826bd81d3542a419d6

Can you give me a solution to this one or give me another hash algorithm that can be used effectively in large strings or large documents?

Rayshawn
  • 2,603
  • 3
  • 27
  • 46
george mano
  • 5,948
  • 6
  • 33
  • 43
  • 2
    Finding things that are *similar* is not the job of MD5 or any hash function. All good hash functions intentionally magnify small differences, since their goal is to reduce collisions. What you want is a metric often called "edit distance", meaning the number of individual edits it would take to turn one string into another. – Daniel Pryden Nov 03 '12 at 05:23

2 Answers2

3

All the hash can tell you is that the strings do or don't match. This question has been asked before: How much two strings are similar?(90%,100%,40%) which advocates the use of the Levenshtein distance. This article outlines how to use the Levenshtein distance and derive a percentage differential from it: http://www.switchplane.com/blog/improving-search-with-levenshtein-distance.php

Community
  • 1
  • 1
gview
  • 14,876
  • 3
  • 46
  • 51
0

If the strings are really long (like entire, possibly large, files) you can break them up into pieces, hash the pieces, and check how many match. That's not entirely dependable though.

If it says most of two strings are identical, that'll probably be accurate. Unless you do quite a bit more to maintain synchronization, it can indicate large differences when the two are nearly identical though. Just for example, if you do it naively, inserting a single byte at the beginning of one string could indicate that the strings are entirely different, even though there's really only one byte that's different.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111