I'm wondering if there's a better way to compare all items in a dataframe column to each other and replace those items if they have a high fuzzy set matching score. I ended up using combinations, but my feeling is that this is memory intensive and inefficient. My code is below.
To clarify: the central question here is not the fuzzy matching aspect, but the aspect of comparing all items in a list to each other and then replacing those items that match.
newl = list(true_df2.Name.unique())
def remove_duplicate_names(newl, Name, origdf, namesave):
"""
This function removes duplicate names. It replaces longer names with shorter names
It takes in (1) newl: a list of unique names, where generic words have already been stripped out.
(2)Name: name of dataframe column
(3)origdf: original dataframe that is being rewritten
(4)namesave: name of saved matchedwords file: e.g, 'save1'. I created (4) because this file
takes a long time to run.
Returns a dataframe
"""
if isinstance(newl, pd.DataFrame):
newl = list(newl[Name].unique())
if isinstance(newl, list):
cnl = list(combinations(newl, 2))
matchword = []
for i in cnl:
fp = fuzz.partial_ratio(i[0], i[1])
if len(i[0]) > 3 and len(i[1]) > 3:
if not i[0] == i[1]:
#if i[0] or i[1] == 'York University':
# continue
#I can edit these conditions to make matches more or less strict
#higher values mean more strict
#using more criteria with 'and' means more strict
if fp >= 98:
shortstr = min(i, key=len)
longstr = max(i,key=len)
matchword.append((shortstr, longstr))
for pair in matchword:
#replace in each longstring spot, the shorter string
print 'pair', pair
print origdf[Name][origdf[Name].str.contains(pair[1])]
#origdf[Name][origdf[Name].str.contains(pair[1])] = pair[0].strip()
origdf.ix[origdf[Name].str.contains(pair[1]), 'Name'] = pair[0]
return origdf