3

I have 200,000 strings. I need to find the similar strings among that set. I expect the number of similar strings to be very low in the set. Please help out with an efficient data structure.

I can use a simple hash if I am looking for exact matching strings. But, 'similarity' is custom defined in my case: two strings are treated similar if 80% of the chars in them are same, order does not matter.

I don't want to call the function finding "similarity" ~(200k*100k) times. Any suggestions like techniques to preprocess the strings, efficient data structures are welcome. Thanks.

syam
  • 562
  • 6
  • 15
  • 1
    what are the lengths of the strings, average versus minimum and maximum? you could make a histogram, where each string has a histogram rank, and then you can simply group them together by a ratio, in this case, 80% – Inbar Rose Dec 27 '12 at 08:59
  • 3
    `'cat'` has 100% similarity with `abracatabra` in your description but `'abracatabra'` has less than 80% with `'cat'`. Is that correct? My point is that the "similarity" is not very well defined. – ypercubeᵀᴹ Dec 27 '12 at 09:05
  • The first thing I thought of was not comparing. How about hash function. I did a little search on stackoverflow, http://stackoverflow.com/questions/8848991/python-digest-hash-for-string-similarity – CppLearner Dec 27 '12 at 09:10
  • Thanks for the replies. I want to use a better similarity function like "jaro" distance ratio. I've a function/library for computing jaro distance ratio. If the ratio is greater than 85% I treat it a match. At the same time I don't want to call the function 200k*200k times since it takes too much runtime. – syam Dec 27 '12 at 09:40
  • I take it you're referring to the [Jaro-Winkler distance](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)? Do you have to use it, or are you open to other alternatives? – Alex L Dec 27 '12 at 09:58
  • Just as a note, I'm assuming you aren't going to compare two strings more than once, so you would only have to call the similarity function 200k * 100k times. I assume that this is still too much though. – Sean Linehan Dec 27 '12 at 10:01
  • 1
    If you are going to compare each string using this function, you can use `for str1, str2 in itertools.combinations(string_list, 2): match = jaro(str1, str2) > 0.85` – Alex L Dec 27 '12 at 10:06
  • I am good with any descent string similarity algorithms. But my main concern is about runtime. You are right Sean, worst case comparisons will be n*(n-1)/2. – syam Dec 27 '12 at 10:09
  • 2
    Perhaps you could use [suffix tree clustering](http://cs.nyu.edu/courses/fall02/G22.3033-008/lec4.html), or at least explore related algorithms? – Alex L Dec 27 '12 at 10:27
  • This question might also help - [Good Python modules for fuzzy string comparison?](http://stackoverflow.com/q/682367/1129194) – Alex L Dec 27 '12 at 10:32
  • http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison helps to understand fuzzy library, but not to reduce the #comparisons. "suffix tree clustering" may help me. Let me read it. Thanks. – syam Dec 27 '12 at 10:34

1 Answers1

1

I learnt that >=0.85 distance ratio is possible only if the string-length difference between two strings is <=3. That means, we can group the strings with length difference <=3.

This drastically reduced the number of string in each group. So, the number of overall comparisons are reduce to slight less than 50% (of 200k*100k) in my data set. Moreover, dividing the the data set into multiple small sets helps to do parallel-processing which further reduces the overall runtime.

Reduction percentage might vary with the sample data set, i.e. worst case happens when all the string are with length difference <=3.

[Thanks to Inbar Rose for stimulating this thought]

In my case, the histogram looked as below:

histogram

syam
  • 562
  • 6
  • 15