1

I'm wondering if there's a better way to compare all items in a dataframe column to each other and replace those items if they have a high fuzzy set matching score. I ended up using combinations, but my feeling is that this is memory intensive and inefficient. My code is below.

To clarify: the central question here is not the fuzzy matching aspect, but the aspect of comparing all items in a list to each other and then replacing those items that match.

newl = list(true_df2.Name.unique())

def remove_duplicate_names(newl, Name, origdf, namesave):
    """
    This function removes duplicate names. It replaces longer names with shorter names
    It takes in (1) newl: a list of unique names, where generic words have already been stripped out.
    (2)Name: name of dataframe column
    (3)origdf: original dataframe that is being rewritten
    (4)namesave: name of saved matchedwords file: e.g, 'save1'. I created (4) because this file 
    takes a long time to run. 
    Returns a dataframe 
    
    """
    if isinstance(newl, pd.DataFrame):
        newl = list(newl[Name].unique())
    
    if isinstance(newl, list):  
        cnl = list(combinations(newl, 2))
        matchword = []
        for i in cnl:
            fp = fuzz.partial_ratio(i[0], i[1])
            if len(i[0]) > 3 and len(i[1]) > 3: 
                if not i[0] == i[1]:
                    #if i[0] or i[1] == 'York University':
                    #    continue 
                    #I can edit these conditions to make matches more or less strict
                    #higher values mean more strict
                    #using more criteria with 'and' means more strict
                    if fp >= 98:
                        shortstr = min(i, key=len)
                        longstr = max(i,key=len)
                        matchword.append((shortstr, longstr))   
    
    for pair in matchword:
        #replace in each longstring spot, the shorter string
        print 'pair', pair
        print origdf[Name][origdf[Name].str.contains(pair[1])]
        #origdf[Name][origdf[Name].str.contains(pair[1])] = pair[0].strip()
        origdf.ix[origdf[Name].str.contains(pair[1]), 'Name'] = pair[0]
    return origdf 
buddemat
  • 4,552
  • 14
  • 29
  • 49
user3314418
  • 2,903
  • 9
  • 33
  • 55
  • possible duplicate of [Python/Pandas: How to Normalize Names](http://stackoverflow.com/questions/19437631/python-pandas-how-to-normalize-names) – Andy Hayden Mar 24 '14 at 23:35
  • Or are you trying to do some kind of clustering? – Andy Hayden Mar 24 '14 at 23:36
  • Sorry, I may have mislabeled the title of my question. it's more centered on how I might be able to efficiently run through all items in a list (comparing each item to each other) and replace them. – user3314418 Mar 24 '14 at 23:52
  • So clustering...? If you have 5 words which are close to each other how do you pick which is the true name (i.e. the thing to replace them with?) – Andy Hayden Mar 25 '14 at 00:16
  • Yeah, I'm using clustering (i'm using the fuzzy wuzzy library) – user3314418 Mar 25 '14 at 00:20
  • Still not following how the cluster name is chosen. Also, you don't need to apply list in definition of cn1 since you're iterating through it. – Andy Hayden Mar 25 '14 at 00:27
  • basically, i choose the name that has the shortest length and i replace only if the fuzzy partial ratio is 100 – user3314418 Mar 25 '14 at 01:32

0 Answers0