Finding the best match for similar texts and keeping only unique values

Asked Jul 20 '19 at 03:21

Active Jul 20 '19 at 03:25

Viewed 94 times

I have a list of Project Names which i have tried to clean up but they contain duplicates with minor mismatch. I want to find their nearest match and replace all occurrences with this match.

I am using Python and Pandas and have a imported a file which has a column inside which Project names are embedded. I did some cleaning and removed extra characters to extract the Project Names. but some names are occurring with minor mismatch. I difflib to find closest match but it returns two values and the best match is itself.

      Project Name  
552   Hilton International
553   Hilton International A

key = df2.iloc[552:553]['Project Name'].tolist()
key = key[0]
difflib.get_close_matches(key, df2['Project Name'].tolist())

expected result:

      Project Name  
552   Hilton International
553   Hilton International

edited Jul 20 '19 at 03:25

inspectorG4dget

110,290
27
149
241

asked Jul 20 '19 at 03:21

AdnanTC

1

My suggestion is that you do something like what I describe here: https://stackoverflow.com/a/20354639/56778 – Jim Mischel Jul 20 '19 at 04:37
1

See also http://blog.mischel.com/2014/10/20/solving-the-right-problem/ – Jim Mischel Jul 20 '19 at 04:44
1

What is your expectation? The result is correct, you can change a little bit like: "Hilton Internationalll" and your code still finds out – Cao Minh Vu Jul 20 '19 at 05:17

Finding the best match for similar texts and keeping only unique values

0 Answers0