6

I have an input table like this:

In [182]: data_set
Out[182]: 
       name             ID
0  stackoverflow       123      
1  stikoverflow        322      
2  stack, overflow     411      
3  internet.com        531      
4  internet            112      
5  football            001

And I want to group similar strings based on fuzzywuzzy. So after applying fuzzy matching, all strings with more than some similarity threshold (like > %90 similarity) would group together. So the desired output would be:

In [182]: output
Out[182]: 
       name             ID     group
0  stackoverflow       123       1
1  stikoverflow        322       1
2  stack, overflow     411       1
3  internet.com        531       2
4  internet            112       2
5  football            001       3

I was searching through different topics and I found this and this which are only name matching and not doing clustering. Also this one shows the best match only which it doesn't help me. This page is also explaining about k-means clustering which the number of clusters needs to be set beforehand, which is not practical in this case.

UPDATE:

I figured out process method in fuzzywuzzy package would handle my problem to some extent. But this method only compares string to a list and not list to list:

from fuzzywuzzy import process

with open("data-set.txt", "r") as f:
     data = f.read().split("\n")
process.extract("stackoverflow",data, limit=3)

Output:

[('stackoverflow', 100), ('stack, overflow', 93), ('stikoverflow', 88)]

But still dont know how can I use it to cluster.

Dio
  • 97
  • 1
  • 8
  • 1
    This is *not* a clustering problem. It's closer related to spelling correction. For an **unsupervised** approach, dog and fog are very close. Doggy and foggy are also close. But dog and doggy are much more different. So don't use anything unsupervised! – Has QUIT--Anony-Mousse Jun 19 '18 at 06:38
  • I believe at some point we could consider it as a clustering problem since we have a similarity function and the similar strings are grouped together based on some threshold, correct? – Dio Jun 19 '18 at 12:44
  • the example I gave is a counterexample for this hypothesis. You don't have a good enough similarity function. Use something supervised. – Has QUIT--Anony-Mousse Jun 19 '18 at 17:37
  • I guess any function I used, I would have some false positives anyhow. I'd like to apply it on millions of records. An end user will evaluate it at the end. I just wanna give them a group of similar records from which they select what they want. What similarity function you recommend? – Dio Jun 19 '18 at 17:51
  • None. There is none that I know that would work. People will likely suggest Levenshtein, but that one really won't work well. Just try it yourself. – Has QUIT--Anony-Mousse Jun 19 '18 at 19:02
  • So Fuzzywuzzy is based on Leveneshtein, right? – Dio Jun 19 '18 at 19:06

1 Answers1

-1

This can be accomplished using string-grouper:

    from string_grouper import group_similar_strings
    group_similar_strings(data_set['name'])

string-grouper