I have a list of Project Names which i have tried to clean up but they contain duplicates with minor mismatch. I want to find their nearest match and replace all occurrences with this match.
I am using Python and Pandas and have a imported a file which has a column inside which Project names are embedded. I did some cleaning and removed extra characters to extract the Project Names. but some names are occurring with minor mismatch. I difflib to find closest match but it returns two values and the best match is itself.
Project Name
552 Hilton International
553 Hilton International A
key = df2.iloc[552:553]['Project Name'].tolist()
key = key[0]
difflib.get_close_matches(key, df2['Project Name'].tolist())
expected result:
Project Name
552 Hilton International
553 Hilton International