I have two Pandas Dataframes, both of varying length. DF1 has about 1.2 millions row (and just 1 column), DF2 has about 300,000 rows (and a single column), and I am trying to find similar items from both lists.
DF1 has about 75% Company Names, and 25% People, and the reverse is true for DF2, but they are both alphanumeric. What I would like is to write a function that will highlight the most similar items from the two lists, ranked by a score (or percentage). For example,
Apple -> Apple Inc. (0.95)
Apple -> Applebees (0.68)
Banana Boat -> Banana Bread (0.25)
So far, I have tried two approaches, both of which have failed.
Method 1: Find Jaccard Coefficients for the two lists.
import numpy as np
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(df_1, df_2)
This does not work, probably due to the varying lengths of the two data frames and I get this error:
ValueError: Found arrays with inconsistent numbers of samples
Method 2:: Using Sequence Matcher
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
And then calling the Dataframes:
similar(df_1, df_2)
This results in an error:
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()
pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()
KeyError: 0
How could I approach this problem?