Better way to compare all items in a dataframe and replace similar items with fuzzy matching python

Question

I'm wondering if there's a better way to compare all items in a dataframe column to each other and replace those items if they have a high fuzzy set matching score. I ended up using combinations, but my feeling is that this is memory intensive and inefficient. My code is below.

To clarify: the central question here is not the fuzzy matching aspect, but the aspect of comparing all items in a list to each other and then replacing those items that match.

newl = list(true_df2.Name.unique())

def remove_duplicate_names(newl, Name, origdf, namesave):
    """
    This function removes duplicate names. It replaces longer names with shorter names
    It takes in (1) newl: a list of unique names, where generic words have already been stripped out.
    (2)Name: name of dataframe column
    (3)origdf: original dataframe that is being rewritten
    (4)namesave: name of saved matchedwords file: e.g, 'save1'. I created (4) because this file 
    takes a long time to run. 
    Returns a dataframe 
    
    """
    if isinstance(newl, pd.DataFrame):
        newl = list(newl[Name].unique())
    
    if isinstance(newl, list):  
        cnl = list(combinations(newl, 2))
        matchword = []
        for i in cnl:
            fp = fuzz.partial_ratio(i[0], i[1])
            if len(i[0]) > 3 and len(i[1]) > 3: 
                if not i[0] == i[1]:
                    #if i[0] or i[1] == 'York University':
                    #    continue 
                    #I can edit these conditions to make matches more or less strict
                    #higher values mean more strict
                    #using more criteria with 'and' means more strict
                    if fp >= 98:
                        shortstr = min(i, key=len)
                        longstr = max(i,key=len)
                        matchword.append((shortstr, longstr))   
    
    for pair in matchword:
        #replace in each longstring spot, the shorter string
        print 'pair', pair
        print origdf[Name][origdf[Name].str.contains(pair[1])]
        #origdf[Name][origdf[Name].str.contains(pair[1])] = pair[0].strip()
        origdf.ix[origdf[Name].str.contains(pair[1]), 'Name'] = pair[0]
    return origdf

possible duplicate of [Python/Pandas: How to Normalize Names](http://stackoverflow.com/questions/19437631/python-pandas-how-to-normalize-names) — Andy Hayden, Mar 24 '14 at 23:35
Sorry, I may have mislabeled the title of my question. it's more centered on how I might be able to efficiently run through all items in a list (comparing each item to each other) and replace them. — user3314418, Mar 24 '14 at 23:52
So clustering...? If you have 5 words which are close to each other how do you pick which is the true name (i.e. the thing to replace them with?) — Andy Hayden, Mar 25 '14 at 00:16
Yeah, I'm using clustering (i'm using the fuzzy wuzzy library) — user3314418, Mar 25 '14 at 00:20
Still not following how the cluster name is chosen. Also, you don't need to apply list in definition of cn1 since you're iterating through it. — Andy Hayden, Mar 25 '14 at 00:27
basically, i choose the name that has the shortest length and i replace only if the fuzzy partial ratio is 100 — user3314418, Mar 25 '14 at 01:32

Better way to compare all items in a dataframe and replace similar items with fuzzy matching python

0 Answers0