most efficient way to group search results by string similarity

Question

I am working on a sql server 2008 DB and asp.net mvc web E-commerce app.

I have different users feeding their products to the DB, and I want to compare the prices of products with similar names. I know that string matching is domain specific, but I still need the best generic solution.

What is the most efficient way to group the search results? Should I compare each of the records recursively using the Levenshtien Distance algorithm? Should I do it in the DB, or in the code? Is there a way to implement SSIS Fuzzy Grouping in real time for this task? Is there an efficient way to do it using the Sql server 2008 free text search?

Edit 1: What about network-graph analysis. If I'll define a matrix using the Levenshtien Distance algorithm, I could use a clustering algorithm (for example: clauset newman moore) and seperate groups that don't have phonological path between them. I have attached Nick Johnson (see comment) cat-dog for example (the red lines are the clusters) - and by using the clauset newman moore I am creating 2 different clusters and seperating cats from dogs.

What do you think?

enter image description here

I would do it in the DB, see this thread: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=66781 and this: http://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql on the Levenshtein distance alg. — Magnus, Mar 29 '12 at 10:10
This is tough - how would you group the products 'cat', 'car', 'bar', 'bag', 'bog', 'dog'? Each is only distance 1 from each other, but 'cat' and 'dog' share no similarities. — Nick Johnson, Mar 29 '12 at 11:41
So what is the alternative? Maybe some kind of semantic dictionary? any other ideas? — Gidon, Mar 29 '12 at 22:34
@NickJohnson: Well... *cat* and *car* have distance 1. *car* and *bar* have distance 1 too. **But** this says that *cat* and *bar* have distance 2 and not 1. You have to make to hops from *cat* to *bar*, don't you? And 5 from *cat* to *dog*. So they are quite far apart. Although adding other words in the graph would end up that *cat* and *dog* are only 3 steps apart... — Robert Koritnik, Mar 30 '12 at 14:36
@RobertKoritnik So what clusters would you separate that set of words into, and why? (Also, note the edit distance from 'cat' to 'dog' is 3. :)) — Nick Johnson, Mar 30 '12 at 15:25
@NickJohnson: I suppose that heavily depends on the business problem at hand. The same is true about distances. Per letter differentiation may not be best distance calculation. Because *cat* and *car* should be further apart than *cat* and *kitty*... But from the perspective of the letters it seems the other way around. So this *graphing thing* may not work as expected. At least not in this way. — Robert Koritnik, Mar 30 '12 at 15:30
@RobertKoritnik It doesn't, really - my point is that string distances follow a metric space (you can't plot them in a space of _any_ fixed dimensionality), and aren't really susceptible to useful clustering. — Nick Johnson, Mar 30 '12 at 15:32

score 0 · Answer 1 · edited May 23 '17 at 12:24

This is a clustering problem and hence computationally difficult, but there are a large number of algorithms known for solving such problems, both exactly and approximately. Have a lok at the wikipedia page on Cluster Analysis and this answer.

Once you have a clustering algorithm implemented you could store the clusters in the DB, but I suspect it would be too expensive to re-compute the clusters on every item added. It would probably be best to run the clustering algorithm once an hour or once a day.

score 0 · Accepted Answer · answered Apr 11 '12 at 00:00

If you can get ahold of a suitable thesaurus/ontology that basically provides the best clustering possible - since words are leaves in a concept tree, distance in the tree is the distance between words in a semantic sense. Thus cat and dog aren't nearly as close as tabby and calico (cat), but they're substantially closer than cat and banana, which themselves are closer than cat(n.) and jump(v.).

Allowing for small spelling errors (by looking for similarly spelled words that are in the thesaurus for words that aren't) could increase robustness, but it could also create unexpected results as a result of homonyms.

As for doing it in the database or in code, do it in code. To the extent that you can cache, that will be faster.

most efficient way to group search results by string similarity

2 Answers2