See if this meets your needs
Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())
The df created is as below
Title Area Price
0 Apartment at Boston 100 150000
1 Apt at Boston 105 149000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300
4 Apt at Seattle 101 150000
5 Apt at Chicago 101 150000
Now, we run the code as below
from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
val=fuzz.partial_ratio(a,b)
return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
itered=i+1
while itered < len(tl):
val=fuzzy_compare(tl[i],tl[itered])
if val > 80:
if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
df.drop(itered,inplace=True)
df.reset_index()
pass
else:
pass
else:
pass
else:
pass
itered=itered+1
while i < len(tl)-1:
try:
do_the_thing(i)
i=i+1
except:
i=i+1
pass
else:
pass
the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.
Title Area Price
0 Apartment at Boston 100 150000
2 Apt at Chicago 100 150200
3 Apt at Seattle 102 150300