Drop Dataframe Rows Based on a Similarity Measure Pandas

Question

I want to eliminate repeated rows in my dataframe.

I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:

       Title        |   Area   |    Price
Apartment at Boston    100         150000
Apt at Boston          105         149000

I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?

It is hard to compare the abbreviation with whole word , even soundex will return the different code — BENY, Aug 01 '19 at 00:56
Yeah I was thinking maybe counting identical words and getting ratio of identical words/total. Similarly, getting the percentage difference between Area and Price values. I am not sure how to implement this though. — Seller Central, Aug 01 '19 at 01:16
Check if fuzzywuzzy will help. https://pypi.org/project/fuzzywuzzy/ — moys, Aug 01 '19 at 02:53

braintho · Answer 1 · 2019-08-01T03:59:14.060

Here is a function using difflib. I got the similar function from here. You may also want to check out some of the answers on that page to determine the best similarity metric for your use case.

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Title':['Apartment at Boston','Apt at Boston'],
                  'Area':[100,105],
                  'Price':[150000,149000]})

def string_ratio(df,col,ratio):
    from difflib import SequenceMatcher
    import numpy as np
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    ratios = []
    for i, x in enumerate(df[col]):
        a = np.array([similar(x, row) for row in df[col]])
        a = np.where(a < ratio)[0]
        ratios.append(len(a[a != i])==0)
    return pd.Series(ratios)

def numeric_ratio(df,col,ratio):
    ratios = []
    for i, x in enumerate(df[col]):
        a = np.array([min(x,row)/max(x,row) for row in df[col]])
        a = np.where(a<ratio)[0]
        ratios.append(len(a[a != i])==0)
    return pd.Series(ratios)

mask = ~((string_ratio(df1,'Title',.95))&(numeric_ratio(df1,'Area',.95))&(numeric_ratio(df1,'Price',.95)))

df1[mask]

It should be able to weed out most of the similar data, though you might want to tweak the string_ratio function if it doesn't suite you case.

Thanks for the reply! How would you alter this code in the case where you want to maintain the first repeated row? — Seller Central, Aug 02 '19 at 03:18

moys · Answer 2 · 2019-08-02T05:59:42.413

See if this meets your needs

Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at   Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())

The df created is as below

Title  Area  Price
0  Apartment at Boston  100  150000
1  Apt at Boston  105  149000
2  Apt at Chicago  100  150200
3  Apt at Seattle  102  150300
4  Apt at Seattle  101  150000
5  Apt at Chicago  101  150000

Now, we run the code as below

from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
    val=fuzz.partial_ratio(a,b)
    return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
    itered=i+1    
    while itered < len(tl):
        val=fuzzy_compare(tl[i],tl[itered])
        if val > 80:
            if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
                if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
                    df.drop(itered,inplace=True)
                    df.reset_index()
                    pass
                else:
                    pass
            else:
                pass            
       else:
            pass
       itered=itered+1    
while i < len(tl)-1:
    try:
        do_the_thing(i)
        i=i+1
    except:
        i=i+1
        pass
else:
    pass

the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.

Title  Area  Price
0  Apartment at Boston  100  150000
2  Apt at Chicago  100  150200
3  Apt at Seattle  102  150300

Thanks for the reply! How would you alter the code to allow multiple (>2) repeated rows and we just want to keep one of them? — Seller Central, Aug 02 '19 at 03:48
This already does that. It will retain the 1st instance only irrespective of how many more similar rows are there. I checked by adding 3 of 'Chicago' and only the first one remains. Please check. — moys, Aug 02 '19 at 05:39
I have updated the code a bit more, to catch an exception.Try this. — moys, Aug 02 '19 at 06:01

Drop Dataframe Rows Based on a Similarity Measure Pandas

2 Answers2