Data structure for comparing 200k strings each other

Question

I have 200,000 strings. I need to find the similar strings among that set. I expect the number of similar strings to be very low in the set. Please help out with an efficient data structure.

I can use a simple hash if I am looking for exact matching strings. But, 'similarity' is custom defined in my case: two strings are treated similar if 80% of the chars in them are same, order does not matter.

I don't want to call the function finding "similarity" ~(200k*100k) times. Any suggestions like techniques to preprocess the strings, efficient data structures are welcome. Thanks.

what are the lengths of the strings, average versus minimum and maximum? you could make a histogram, where each string has a histogram rank, and then you can simply group them together by a ratio, in this case, 80% — Inbar Rose, Dec 27 '12 at 08:59
`'cat'` has 100% similarity with `abracatabra` in your description but `'abracatabra'` has less than 80% with `'cat'`. Is that correct? My point is that the "similarity" is not very well defined. — ypercubeᵀᴹ, Dec 27 '12 at 09:05
The first thing I thought of was not comparing. How about hash function. I did a little search on stackoverflow, http://stackoverflow.com/questions/8848991/python-digest-hash-for-string-similarity — CppLearner, Dec 27 '12 at 09:10
Thanks for the replies. I want to use a better similarity function like "jaro" distance ratio. I've a function/library for computing jaro distance ratio. If the ratio is greater than 85% I treat it a match. At the same time I don't want to call the function 200k*200k times since it takes too much runtime. — syam, Dec 27 '12 at 09:40
I take it you're referring to the [Jaro-Winkler distance](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance)? Do you have to use it, or are you open to other alternatives? — Alex L, Dec 27 '12 at 09:58
Just as a note, I'm assuming you aren't going to compare two strings more than once, so you would only have to call the similarity function 200k * 100k times. I assume that this is still too much though. — Sean Linehan, Dec 27 '12 at 10:01
If you are going to compare each string using this function, you can use `for str1, str2 in itertools.combinations(string_list, 2): match = jaro(str1, str2) > 0.85` — Alex L, Dec 27 '12 at 10:06
I am good with any descent string similarity algorithms. But my main concern is about runtime. You are right Sean, worst case comparisons will be n*(n-1)/2. — syam, Dec 27 '12 at 10:09
Perhaps you could use [suffix tree clustering](http://cs.nyu.edu/courses/fall02/G22.3033-008/lec4.html), or at least explore related algorithms? — Alex L, Dec 27 '12 at 10:27
This question might also help - [Good Python modules for fuzzy string comparison?](http://stackoverflow.com/q/682367/1129194) — Alex L, Dec 27 '12 at 10:32
http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison helps to understand fuzzy library, but not to reduce the #comparisons. "suffix tree clustering" may help me. Let me read it. Thanks. — syam, Dec 27 '12 at 10:34

score 1 · Accepted Answer · answered Jan 07 '13 at 08:54

I learnt that >=0.85 distance ratio is possible only if the string-length difference between two strings is <=3. That means, we can group the strings with length difference <=3.

This drastically reduced the number of string in each group. So, the number of overall comparisons are reduce to slight less than 50% (of 200k*100k) in my data set. Moreover, dividing the the data set into multiple small sets helps to do parallel-processing which further reduces the overall runtime.

Reduction percentage might vary with the sample data set, i.e. worst case happens when all the string are with length difference <=3.

[Thanks to Inbar Rose for stimulating this thought]

In my case, the histogram looked as below:

histogram

Data structure for comparing 200k strings each other

1 Answers1