36

I've had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings.

For example:

levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences

are treated as having less in common than:

levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences

I'd prefer an algorithm which saw that the first two were more similar.

How could I go about coming up with a comparison function that can identify substrings which have switched position as being distinct to edits?

One possible approach I've thought of is to put all the words in the string into alphabetical order, before the comparison. That takes the original order of the words completely out of the comparison. A downside to this, however, is that changing just the first letter of a word can create a much bigger disruption than a changing a single letter should cause.

What I'm trying to achieve is to compare two facts about people which are free text strings, and decide how likely these facts are to indicate the same fact. The facts might be the school someone attended, the name of their employer or publisher, for example. Two records may have the same school spelled differently, words in a different order, extra words, etc, so the matching has to be somewhat fuzzy if we are to make a good guess that they refer to the same school. So-far it is working very well for spelling errors (I am using a phoenetic algorithm similar to metaphone on top of this all) but very poorly if you switch the order of words around which seem common in a school: "xxx college" vs "college of xxx".

csl
  • 10,937
  • 5
  • 57
  • 89
thomasrutter
  • 114,488
  • 30
  • 148
  • 167
  • What is the goal you want to achieve? Levenshtein has a theoretically simple method to tell small differences and intended to recognize for example typos. If your goal is different, you first need to find out a theoretical way to tell the "difference" in your meaning between the two string, then implementation is just matter of craftmanship. – Csaba Kétszeri May 06 '09 at 07:35

9 Answers9

25

N-grams

Use N-grams, which support multiple-character transpositions across the whole text.

The general idea is that you split the two strings in question into all the possible 2-3 character substrings (n-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be then normalized by dividing the shared number by the total number of n-grams in the longer string. This is trivial to calculate, but fairly powerful.

For the example sentences:

A. The quick brown fox
B. brown quick The fox
C. The quiet swine flu

A and B share 18 2-grams

A and C share only 8 2-grams

out of 20 total possible.

This has been discussed in more detail in the Gravano et al. paper.

tf-idf and cosine similarity

A not so trivial alternative, but grounded in information theory would be to use term term frequency–inverse document frequency (tf-idf) to weigh the tokens, construct sentence vectors and then use cosine similarity as the similarity metric.

The algorithm is:

  1. Calculate 2-character token frequencies (tf) per sentence.
  2. Calculate inverse sentence frequencies (idf), which is a logarithm of a quotient of the number of all sentences in the corpus (in this case 3) divided by the number of times a particular token appears across all sentences. In this case th is in all sentences so it has zero information content (log(3/3)=0). idf formula
  3. Produce the tf-idf matrix by multiplying corresponding cells in the tf and idf tables. tfidf
  4. Finally, calculate cosine similarity matrix for all sentence pairs, where A and B are weights from the tf-idf table for the corresponding tokens. The range is from 0 (not similar) to 1 (equal).
    cosine similarity
    similarity matrix

Levenshtein modifications and Metaphone

Regarding other answers. Damerau–Levenshtein modificication supports only the transposition of two adjacent characters. Metaphone was designed to match words that sound the same and not for similarity matching.

Tomasz
  • 5,269
  • 8
  • 56
  • 65
  • 1
    Can we do a mixture of both? divide the terms into bigrams then find the cosine similarity ? – jxn Jun 11 '15 at 00:19
  • 1
    @Jenn excellent question and the answer is yes, see http://ii.nlm.nih.gov/MTI/Details/trigram.shtml – Tomasz Jun 12 '15 at 02:13
  • 1
    Thank you for telling me about n-grams back when you wrote this. I have used n-grams on words (rather than individual characters) on a number of different projects since this. – thomasrutter Feb 11 '19 at 23:06
10

Its easy. Just use the Damerau-Levenshtein distance on the words instead of letters.

Unknown
  • 45,913
  • 27
  • 138
  • 182
  • Do you mean: "for every word in A, find the levenshtein distance to every word in B, then add up your results"? – thomasrutter May 06 '09 at 05:29
  • 2
    No, I mean turn every word into a symbol: ie The = a, quick = b, brown = c, etc. And then run the levenshtein algorithm on that. – Unknown May 06 '09 at 05:34
  • No I see what you mean, you mean implement the levenshtein algorithm which compares words rather than letters. Unfortunately this still not work for me, as two words which swap position with each other would still count the same as deleting a word and creating an entirely different word. – thomasrutter May 06 '09 at 05:34
  • Ie levenshtein("abcd", "cbad") is still no more similar than levenshtein("abcd", "abxy") – thomasrutter May 06 '09 at 05:36
  • 2
    Then you might look at similar algorithms like http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance – Unknown May 06 '09 at 05:39
  • I like the sound of this Damerau-Levenshtein distance (with transpositions). Only thing I'm worried about now is how much slower it's going to be implementing it in the PHP code. Thanks for the tip! – thomasrutter May 06 '09 at 05:52
  • An actual example of the "easy" distance on words instead of letters would be awesome. – Nancy May 20 '22 at 15:00
6

Explode on spaces, sort the array, implode, then do the Levenshtein.

rooskie
  • 482
  • 4
  • 8
3

You can also try this. (just an extra suggestion)

$one = metaphone("The quick brown fox"); // 0KKBRNFKS
$two = metaphone("brown quick The fox"); // BRNKK0FKS
$three = metaphone("The quiet swine flu"); // 0KTSWNFL

similar_text($one, $two, $percent1); // 66.666666666667
similar_text($one, $three, $percent2); // 47.058823529412
similar_text($two, $three, $percent3); // 23.529411764706

This will show that the 1st and 2nd are more similar than one and three and two and three.

Ólafur Waage
  • 68,817
  • 22
  • 142
  • 198
  • I think this improvement in score would be more from the use of similar_text rather than from metaphone. I'm currently using a phoenetic algorithm very similar to metaphone. I haven't looked much into the algorithm similar_text uses. I was under the impression it was a lot less efficient than levenshtein, but I guess you get what you pay for. I might try it. – thomasrutter May 06 '09 at 12:21
  • I tried with only similar text and it gave a much lower score and a lower score between one and two, than one and three. – Ólafur Waage May 06 '09 at 13:02
3

I've been implementing levenshtein in a spell checker.

What you're asking for is counting transpositions as 1 edit.

This is easy if you only wish to count transpositions of one word away. However for transposition of words 2 or more away, the addition to the algorithm is worst case scenario !(max(wordorder1.length(), wordorder2.length())). Adding a non-linear subalgorithm to an already quadratic algorithm is not a good idea.

This is how it would work.

if (wordorder1[n] == wordorder2[n-1])
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
}
  else
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
}

JUST for touching transpositions. If you want all transpositions, you'd have to for every position work backwards from that point comparing

1[n] == 2[n-2].... 1[n] == 2[0]....

So you see why they don't include this in the standard method.

Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183
1

i believe this is a prime example for using a vector-space search engine.

in this technique, each document essentially becomes a vector with as many dimensions as there are different words in the entire corpus; similar documents then occupy neighboring areas in that vector space. one nice property of this model is that queries are also just documents: to answer a query, you simply calculate their position in vector space, and your results are the closest documents you can find. i am sure there are get-and-go solutions for PHP out there.

to fuzzify results from vector space, you could consider to do stemming / similar natural language processing technique, and use levenshtein to construct secondary queries for similar words that occur in your overall vocabulary.

flow
  • 3,624
  • 36
  • 48
1

Take this answer and make the following change:

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
  /* TRY SWAPPING FIRST TWO CHARACTERS */
  if (w[1]){
    swap(w[0], w[1]);
    match(t, w, s, budget-1);
    swap(w[0], w[1]);
  }
}

This is for dictionary search in a trie, but for matching to a single word, it's the same idea. You're doing branch-and-bound, and at any point, you can make any change you like, as long as you give it a cost.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • This looks like it could be quite useful, though it will take a bit of research on my part to figure out how it works. I haven't used a Trie before, so I'll investigate. – thomasrutter May 07 '09 at 13:13
  • @thomas. You only need the trie if you're searching a dictionary. If you're just comparing two strings (or lists of things), the "foreach" just becomes a simple statement block. Recursive branch-and-bound is a pretty useful Swiss Army knife. – Mike Dunlavey May 07 '09 at 14:25
1

Eliminate duplicate words between the two strings and then use Levenshtein.

JRL
  • 76,767
  • 18
  • 98
  • 146
0

If the first string is A and the second one is B:

  1. Split A and B into words
  2. For every word in A, find the best matching word in B (using levenshtein)
  3. Remove that word from B and put it in B* at the same index as the matching word in A.
  4. Now compare A and B*

Example:

A: The quick brown fox
B: Quick blue fox the
B*: the Quick blue fox

You could improve step 2 by doing it in multiple passes, finding only exact matches at first, then finding close matches for words in A that don't have a companion in B* yet, then less close matches, etc.

Bart van Heukelom
  • 43,244
  • 59
  • 186
  • 301