0

Suppose I've datasets in Pandas DataFrame :

Sr.No|query
-----------
1. tiger
2. tigers
3. lion
4. lionx
5. ilion
6. 56tigers

The resultant dataset should contain :

Sr.No|query
-----------
1. tiger
2. tiger
3. lion
4. lion
5. lion
6. tiger

I have no idea how to do it, so if you can give any links/book names with the code that will be preferred. I know it is broad topic and may use nltk and clustering algorithms like kNN. But any kind of help will be appreciated.

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
neha
  • 49
  • 6
  • You might want to just compare edit distance b/w the words and cluster them based on that. You can also throw in dictionary words (eg. tiger, tigers, tiggers all map to tiger from the dictionary). – ffledgling Aug 03 '16 at 12:46
  • Someone asked something similar here:http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas – Jan Zeiseweis Aug 03 '16 at 13:02
  • import difflib for i in range(149,42365): for j in range(i+1,42365): if (difflib.SequenceMatcher(None, temp['query'][i], temp['query'][j]).ratio() > 0.60): if len(temp['query'][i]) < len(temp['query'][j]): temp['query'][j]=temp['query'][i] I am using the difflib package as above. If the similarity score is more than 0.60 then it should compare their lengths and replace the longer length value to the shorter one because in the dataset, mostly extra characters are there to be removed. But as you can see the loop is taking much time, though it is workin – neha Aug 04 '16 at 06:39
  • Can you suggest something here ? – neha Aug 04 '16 at 06:42

0 Answers0