I am trying to figure out the best way possible to align my dataset which contains "Company Names". My dataset is about 300k rows and 3 columns. I tried many methods so far including Fuzzywuzzy using
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
Now this code has two data sets and when I convert df[Name] into two and match with the above method the first one by default becomes 100% since the list is duplicate.
My exact code is
import pandas as pd
df = pd.DataFrame({"Name" : ["Google","google.inc", "ddood"]})
df2 = pd.DataFrame({"Name" : ["google","google"]})
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
get_match = []
for row in df.index:
name1 = []
name1.append(df.get_value(row,"Name"))
for columns in df2.index:
name2 = []
name2.append(df2.get_value(columns,"Name") )
matched_token=[process.extract(x, name2, limit = 2)[0][1] for x in name1]
get_match.append([matched_token, name1[0], name2[0]])
df_maneet = pd.DataFrame({'name1': [i[1] for i in get_match], 'name2':[i[2] for i in get_match], 'Ratio': [i[0][0] for i in get_match]})
new_df = df_maneet[df_maneet.Ratio>95]
I am in doubt if the above is the best way to approach my problem. My end result should be all similar like companies making a group.
The below answer did not help as well finding-similar-contact-names-within-table