11

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number/scores (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) scores/hashes.

Let's consider these strings and scores as an example:

Hello world 1000

Hello world! 1010

Hello earth 1125

Foo bar 3250

FooBarbar 3750

Foo Bar! 3300

Foo world! 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

My end aim is : there would be streaming log messages(only pure messages) and i wanna find the pattern of those messages(some sort of regular expression type).But that gets started only when i can bucket similar strings. I again focus that I should get some number/scores (hash) for each string AND THAT CAN LATER tell me that two strings are or are not similar

Ajay
  • 4,134
  • 3
  • 20
  • 19
  • 1
    possible duplicate of [String similarity algorithims?](http://stackoverflow.com/questions/3576211/string-similarity-algorithims) (and many other previous questions) – Fred Foo Jul 12 '11 at 14:02
  • 1
    @larsmans the solutions of the post are deviated towards somewhat i am not looking to(i.e they concentrates on string similarity based on comparing strings). For me, the data are huge and streaming, so comparing strings is out of question. i just find one way to solve:scoring(may be bad type of hashing) the each string and THAT can LATER tell me that two strings are or are not similar. – Ajay Jul 12 '11 at 16:33
  • Hello, I am also very interested in this problem. Did you get any progress on this problem? – Bloodmoon Jan 13 '14 at 09:02
  • @Bloodmoon: couldn't compute the hash value(integer). Haven't been able to concentrate on research as I have to concentrate on job. However, I had tweaked charikar Hash algorithm to get it work to some extent. Still, there is some theoretical limitation when the string is of few words. It is hard to know whether say, Hello world is similar to Hello earth or Foo world or Foo earth? But still need to research for better optimisation for strings with good enough number of words.. – Ajay Jan 13 '14 at 11:35
  • Can similarity be defined by a distance threshold of the hash values? Say if the function is f(), and |f("hello world")-f("hello worth")| < t, where t is the threshold, then "hello world" is similar to "hello worth". Otherwise they are not similar. – Bloodmoon Jan 14 '14 at 08:18
  • @Bloodmoon the problem is how to define the threshold. suppose |f("hello world")-f("hello worth")| = 100 then |f("hello world")-f("fello world")| = 110, then does that mean "fello world" is similar to "hello worth" as both are similar to hello world. the usecase we are trying to solve is automatic clustering. let me give you another example. User X logged in, User X logged out, User Y logged in, User Y logged out. Its something that depends upon personal intuition in case of short strings. However, these threshold concept can work in case of relatively large strings. – Ajay Jan 14 '14 at 08:34
  • Regarding my progress, I couldn't compute the integer simhash for string but the vector of 128 bit and compute hamming distance between strings. After each cluster consists more than say, 5 value, compute the representative vector and compare the incoming string with those representative clusters only(will help in computation). See Charikar Hash. But still lots of things are to be done. Couldn't solely focus on this. Will continue on this.. if I get some funding for graduate degree. – Ajay Jan 14 '14 at 08:52
  • I see. If you are trying to cluster something, I think you may have a look at Pattern Classification, which can automatically determine which cluster an new incoming string belongs to by using the Bayes Formula, based on the previous training. This method is not 100% accurate but has a very high precision and very wide adopation. Maybe combining the pattern classification and hamming distance will help. – Bloodmoon Jan 15 '14 at 05:28
  • And why cannot you compute the integer simhash but a vector of 128 bits? – Bloodmoon Jan 15 '14 at 05:31
  • hmm.. need little time to research on this.. have also researched on some papers of Risto Varandi(he has lots of paper regarding Pattern Classification).. regarding integer mapping, simply converting bits to int didn't help in automatic clustering.. so, had got some alternatives idea then.. will continue if I get some funding .. Thanks for you suggesstions :) – Ajay Jan 18 '14 at 14:43
  • It's been almost seven hears. Did you come up with a solution? – raratiru Mar 28 '18 at 17:00

8 Answers8

7

Have a look at locality-sensitive hashing.

The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

There's a very good explanation available here together with some sample code.

Jeff Foster
  • 43,770
  • 11
  • 86
  • 103
  • it has been the best concept i have encountered so far. i'll soon read all relevant documents. thank you very much – Ajay Jul 13 '11 at 12:42
6

TL;DR: Python BK-tree

Interesting question. I have limited experience within this field, but since the Levenshtein distance fulfills the triangle inequality, I figured that there must be a way of computing some sort of absolute distance to an origin in order to find strings in the vicinity of each other without performing direct comparisons against all entries in the entire database.

While googling on some terms related to this, I found one particularly interesting thesis: Aspects of Metric Spaces in Computation by Matthew Adam Skala.

At page 26 he discusses similarity measures based on kd-trees and others, but concludes:

However, general metric spaces do not provide the geometry required by those techniques. For a general metric space with no other assumptions, it is necessary distance-based to use a distance-based approach that indexes points solely on the basis of their distance from each other. Burkhard and Keller [35] offered one of the first such index structures, now known as a BK-tree for their initials, in 1973. In a BK-tree, the metric is assumed to have a few discrete return values, each internal node contains a vantage point, and the subtrees correspond to the different values of the metric.

A blog entry about how BK-trees work can be found here.

In the thesis, Skala goes on describing other solutions to this problem, including VP-trees and GH-trees. Chapter 6 analyses distances based on the Levenshtein edit distance. He also presents some other interesting distance metrics for strings.

I also found Foundations of Multidimensional and Metric Data Structures, which seems relevant to your question.

Alexander Torstling
  • 18,552
  • 7
  • 62
  • 74
5

There are several such "scores", but they all depend on how you define similarity.

Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • soundex might not work here as my messages contains numeric value and some special characters as well. regarding Levenshtein distance, I am looking to bucket the hash value rather than comparing the string themselves which might be slow for streaming data that are huge in numbers. May be that can be used as back up idea later on... anyway thanks for your help :) – Ajay Jul 13 '11 at 14:48
2

For a fast way of determining string similarity, you might want to use fuzzy hashing.

Abhay Buch
  • 4,548
  • 1
  • 21
  • 26
1

You might want to look at using a BK-Tree. Here is a discussion and python implementation.

A BK-Tree stores strings in a tree, sorted by Levenshtein distance to the parent nodes. This is normally used to prune the search space when looking for similar strings, but it seems that this tree would form a natural ordering that could be used to create clusters.

Ryan Ginstrom
  • 13,915
  • 5
  • 45
  • 60
1

I don't know if you are still into this, but in information theory there is a way to measure how much information a string or chunk of text has, maybe you could use that value as a hash in order to sort your strings. It is called entropy, and wikipedia has a nice article about it: https://en.wikipedia.org/wiki/Entropy_(information_theory)

Rafael Rios
  • 564
  • 1
  • 7
  • 20
1

You can always use Levenshtein distance, also, there is a written implementation for that: http://code.google.com/p/pylevenshtein/

But, for simplicity, you can use builtin difflib module:

>>> import difflib
>>> l
{'Hello Earth', 'Hello World!', 'Foo Bar!', 'Foo world!', 'Foo bar', 'Hello World', 'FooBarbar'}
>>> difflib.get_close_matches("Foo World", l)
['Foo world!', 'Hello World', 'Hello World!']

http://docs.python.org/library/difflib.html#difflib.get_close_matches

utdemir
  • 26,532
  • 10
  • 62
  • 81
  • the edit distance method is just for comparing two strings at a time.its not possible due to space complexity(huge number of data...i think i mention streaming data)... its not possible to store the data and chech one by one using Levenshtein distance – Ajay Jul 19 '11 at 03:46
0

You may be interested in Hamming Distance. The Python function hamming_distance() computes the Hamming distance between two strings.

def hamming_distance(s1, s2):
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
Wolfwyrd
  • 15,716
  • 5
  • 47
  • 67