58

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

Let's consider these strings and scores as an example:

Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

Josef Sábl
  • 7,538
  • 9
  • 54
  • 66
  • 4
    What do you mean by 'similar'? Are 'hello world', 'world hello' and 'dlrow olleh' similar? If so, or if not, why? – smirkingman Dec 01 '10 at 15:18
  • 1
    What happens when more than two strings are equidistant from each other? You can't model that with a 1-dimensional score. – mbeckish Dec 01 '10 at 18:05
  • @smirkingman It does not matter, I was thinking mainly about the general similarity score concept. But let's say similar as in Levenshtein. – Josef Sábl Dec 02 '10 at 07:41
  • 1
    Hello, I am also very interested in this problem. Did you get any progress on this problem? – Bloodmoon Jan 13 '14 at 09:05
  • @JosefSábl did you figure this out? I am trying to find something similar and surprised and not complex; but struggling! In Machine Learning there are things like Word2Vec which seams over complex but maybe that is what i should do – Joe Booth Feb 18 '16 at 21:09

12 Answers12

36

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

DougW
  • 28,776
  • 18
  • 79
  • 107
19

Levenstein distance or its derivatives is the algorithm you want. Match given string to each of strings from dictionary. (Here, if you need only fixed number of most similar strings, you may want to use min-heap.) If running Levenstein distance for all strings in dictionary is too expensive, then use some rough algorithm first that will exclude too distant words from list of candidates. After that, run levenstein distance on left candidates.


One way to remove distant words is to index n-grams. Preprocess dictionary by splitting each of words into list of n-grams. For example, consider n=3:

(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]

Next, create index of n-gramms:

" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]

When you need to find most similar strings for given string, you split given string into n-grams and select only those words from dictionary which have at least one matching n-gram. This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.


If your strings are long enough, you may reduce index size by using min-hashing technnique: you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.

P.S. this presentation seems like a good introduction to your problem.

gudok
  • 4,029
  • 2
  • 20
  • 30
13

This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.

For example, you cannot assign numbers to these three phrases:

  • one two
  • one six
  • two six

Such that the numbers reflect the difference between all three phrases.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • 4
    I'm going to get a little Information Theory-y here and argue that you have actually just done what you claim is impossible. Each of those strings can be represented as a binary number (ie integer), and you have just proven that you're able to identify structure in that number that describes what you call "difference". I think the question really being asked is, is there a simpler set of numbers we could map strings to that would, without loss, represent the complete set of possible relationships. This is essentially the Kolmogorov complexity of the search space. – DougW Jan 18 '12 at 19:43
  • 1
    Clearly it's impossible to have a lower Kolmogorov complexity for arbitrary strings with arbitrary definitions of "alike". I would argue that it is, however, entirely possible that you could reduce that complexity by restricting the set or definition (ie, only English language strings). It's possible that the complexity of that question is vastly smaller than the complexity of the unbounded one and, possibly, mappable to a smaller integer space. – DougW Jan 18 '12 at 19:48
  • @DougW I'm afraid I don't follow your argument. Komologorov complexity is irrelevant here; the issue is dimensionality. Can you assign a number to each of those phrases such that the distance between the numbers reflects the edit distance between the strings? – Nick Johnson Jan 19 '12 at 00:32
  • Yeah it's a lot to type into a comment! My point is that "integers" in this case need not be restricted to single dimensional, linear comparisons as you're suggesting. The reductio ad absurdum of your statement is that you yourself have already given "numbers" which contain all the dimensionality necessary to make any comparison you want. However that is the extreme. What if we could map the ENTIRE space of string characteristics we care about to two bits, where each bit represented some relationship dimension? Probably we can't, but what about 3 bits? 4? Where do we draw the line? – DougW Jan 19 '12 at 18:18
  • 1
    This is where the K complexity comes into play. If we consider "likeness" to be "the number of letter As", then the complexity of that space is small enough to map to an easily understood number. As we increase the complexity of our definition of "like", the K complexity of the problem space increases, and the representation becomes more complex as well. We may reduce this complexity by discarding unimportant information. It's very much like representing the Earth on a 2D map. We can map to lower dimensions at the expensive of information and still approximate the info we care about. – DougW Jan 19 '12 at 18:28
  • @DougW There's two problems with what you're suggesting. First, as I've already pointed out, string similarity is a metric space, and doesn't have a fixed dimension. If you try and categorise string differences into a fixed number of dimensions, there will always be a counter-example that doesn't work. – Nick Johnson Jan 19 '12 at 23:16
  • In response to your other assertion, yes, a string is a number, but the operation the OP is specifically asking for is "This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value." - that is, integers under subtraction. That requires reducing each string to a single dimension, which isn't possible as I already demonstrated. – Nick Johnson Jan 19 '12 at 23:16
  • Sure, I'm not arguing with your answer for unbounded, arbitrary string comparisons. My point is that you're making an assumption about what "alike" means, so your answer begs the question. One can easily come up with definitions (or a set of definitions) for "likeness" that CAN be mapped to single dimensions, or a small group of them. Would your answer change if I defined "likeness" as "the number of vowels"? The frequency of a given n-gram? We could come up with a small set of these questions that are each measurable and, together, create a signature that achieves the spirit of the question. – DougW Jan 20 '12 at 01:09
  • @DougW Sure, I'm making an assumption, but it's an eminently reasonable one - that the OP meant to use some sort of edit distance measure. And you cannot come up with definitions that can be mapped to a fixed set of dimensions and still preserve edit distance - that's the whole point of saying it's a metric space. – Nick Johnson Jan 20 '12 at 05:27
  • 1
    Alright, clearly that's your interpretation of the question and I respect that. Personally I don't see anything in his question that indicates he cares about "edit distance". He cares about similarity of strings (see his response to mbeckish in the comments), which I believe is a different question and can be at least approximated by hashing (see my own answer). – DougW Jan 20 '12 at 18:59
  • 1
    calculate edit distance, choose: base="one two"; ED(base, "one two") = 0; ED(base, "one six") = 6; ED(base, "two six") = 8; the number doesnt represent the value of the string universally, but it can somewhat determine the order of differences in your given collection – tu4n Apr 14 '16 at 12:47
  • @nmutan You haven't defined `ED("one six", "two six")`. As soon as you do, you'll find that at least one pair has an integer difference that's inconsistent with its edit distance. – Nick Johnson Apr 14 '16 at 12:59
4

While the idea seems extremely sweet... I've never heard of this.

I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.

There are fairly elaborated technics, the one I am currently working on combines:

  • A Bursted Trie, with level compactness
  • A Levenshtein Automaton

Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.

If you ever find such a method, I am extremely interested :)

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
2

In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.

Imagine similarity at the character level

stops
spots

hello world
world hello

In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)

Then compare the following similar messages

hello world
ello world

Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.

In the case of

spots
stops

The words only differ by position of the characters, so some form of position is important.

If the following strings are similar

 yesssssssssssssss
 yessssssssssssss

Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.

In general this is treated as a multi-dimensional problem - breaking the string into a vector

[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]

But the values of the vector can not be

  • represented by a fixed size number, or
  • give good quality difference measure.

If the number of words, or length of strings were bounded, then a solution of coding may be possible.

Bounded values

Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.

data mining solution

If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.

I have code for such at github : clustering

Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.

Edit Distance or Levenshtein distance

This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.

mksteve
  • 12,614
  • 3
  • 28
  • 50
2

Would Levenshtein distance work for you?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • 3
    It is exactly what it is: distance. It doesn't give you any characteristic to one string, it can be used only to compare two particular strings. – Nikita Rybak Dec 01 '10 at 12:03
  • Nikita is right, that's the problem. Other than that, it would be just what I need. – Josef Sábl Dec 01 '10 at 13:10
  • 1
    A one-dimensional "characteristic" won't work, because if distance(a,b) = 1 and distance(b,c) = 1, it does not follow that distance(a,c) = 2. What are you **really** trying to do? – Karl Knechtel Dec 01 '10 at 13:46
  • i was looking for something like this but now i doubt that it's possible because it's Levenshtein distance and it's not similar to euclidean distance where dist(a,c) = dist(a,b) +/- dist(b,c). – Alvin Dec 09 '13 at 11:42
  • This works if you choose a base-string and calculate the distance from it to all of your string, checkout my answer – tu4n Apr 14 '16 at 12:26
  • @KarlKnechtel - Even if it was one-dimensional, it could be that `a` and `c` are the same and so `distance(a,c)` would be `0` even though the distance from either of them to `b` is `1`. – ArtOfWarfare Jun 17 '17 at 13:43
1

I think of something like this:

  1. remove all non-word characters
  2. apply soundex
alpha-mouse
  • 4,953
  • 24
  • 36
  • Not bad, but: a) my strings don't contain only words, b) I will get soundexes, ok, but how do I compare if they are similar :-) – Josef Sábl Dec 01 '10 at 13:05
1

Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.

Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.

And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.

Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.

Przemek Kryger
  • 687
  • 4
  • 11
1

Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.

Just an idea.

ready-to-run PCA in C#

smirkingman
  • 6,167
  • 4
  • 34
  • 47
0

In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:

{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"  
Med(base-string, "Hello World!") = 1  
Med(base-string, "Hello Earth") = 8  
1st closest string is "Hello World!"

This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)

public static int Distance(string s1, string s2)
{
    int[,] matrix = new int[s1.Length + 1, s2.Length + 1];

    for (int i = 0; i <= s1.Length; i++)
        matrix[i, 0] = i;
    for (int i = 0; i <= s2.Length; i++)
        matrix[0, i] = i;

    for (int i = 1; i <= s1.Length; i++)
    {
        for (int j = 1; j <= s2.Length; j++)
        {
            int value1 = matrix[i - 1, j] + 1;
            int value2 = matrix[i, j - 1] + 1;
            int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);

            matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
        }
    }

    return matrix[s1.Length, s2.Length];
}

Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here

tu4n
  • 4,200
  • 6
  • 36
  • 49
0

It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.

The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.

In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.

Déjà vu
  • 28,223
  • 6
  • 72
  • 100
-2

Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.

You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

Oscar Olim
  • 104
  • 4