Naming similar queries to a single standard query in python

Asked Aug 03 '16 at 12:36

Active Aug 03 '16 at 15:10

Viewed 40 times

Suppose I've datasets in Pandas DataFrame :

Sr.No|query
-----------
1. tiger
2. tigers
3. lion
4. lionx
5. ilion
6. 56tigers

The resultant dataset should contain :

Sr.No|query
-----------
1. tiger
2. tiger
3. lion
4. lion
5. lion
6. tiger

I have no idea how to do it, so if you can give any links/book names with the code that will be preferred. I know it is broad topic and may use nltk and clustering algorithms like kNN. But any kind of help will be appreciated.

edited Aug 03 '16 at 15:10

Nickil Maveli

29,155
8
82
85

asked Aug 03 '16 at 12:36

neha

You might want to just compare edit distance b/w the words and cluster them based on that. You can also throw in dictionary words (eg. tiger, tigers, tiggers all map to tiger from the dictionary). – ffledgling Aug 03 '16 at 12:46
Someone asked something similar here:http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas – Jan Zeiseweis Aug 03 '16 at 13:02
import difflib for i in range(149,42365): for j in range(i+1,42365): if (difflib.SequenceMatcher(None, temp['query'][i], temp['query'][j]).ratio() > 0.60): if len(temp['query'][i]) < len(temp['query'][j]): temp['query'][j]=temp['query'][i] I am using the difflib package as above. If the similarity score is more than 0.60 then it should compare their lengths and replace the longer length value to the shorter one because in the dataset, mostly extra characters are there to be removed. But as you can see the loop is taking much time, though it is workin – neha Aug 04 '16 at 06:39
Can you suggest something here ? – neha Aug 04 '16 at 06:42

Naming similar queries to a single standard query in python

0 Answers0