1

I have two Pandas Dataframes, both of varying length. DF1 has about 1.2 millions row (and just 1 column), DF2 has about 300,000 rows (and a single column), and I am trying to find similar items from both lists.

DF1 has about 75% Company Names, and 25% People, and the reverse is true for DF2, but they are both alphanumeric. What I would like is to write a function that will highlight the most similar items from the two lists, ranked by a score (or percentage). For example,

Apple -> Apple Inc. (0.95) 
Apple -> Applebees (0.68)
Banana Boat -> Banana Bread (0.25)

So far, I have tried two approaches, both of which have failed.

Method 1: Find Jaccard Coefficients for the two lists.

import numpy as np
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(df_1, df_2)

This does not work, probably due to the varying lengths of the two data frames and I get this error:

ValueError: Found arrays with inconsistent numbers of samples

Method 2:: Using Sequence Matcher

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

And then calling the Dataframes:

similar(df_1, df_2)

This results in an error:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)()

KeyError: 0

How could I approach this problem?

user2762934
  • 2,332
  • 10
  • 33
  • 39
  • `SequenceMatcher` can match strings, not DataFrames. – Abbas May 06 '16 at 17:05
  • You want to compare all 1.2 Million in df1 with all 0.3 Million in df2 The question for you is, what do you want the final output to look like? Do you want to fine the most similar element of df2 for each element of df1? – piRSquared May 06 '16 at 17:07
  • This is a similar answer, but comparing with in a `DataFrame` http://stackoverflow.com/questions/36802453/comparing-two-columns-of-a-csv-and-outputting-string-similarity-ratio-in-another/36807361#36807361 – Abbas May 06 '16 at 19:22
  • It would be helpful if you could provide some sample data, and describe what your desired output should be. If you really wanted to compute a similarity score for *every* pair of rows in your two DataFrames then you would need to generate a `(300000, 1200000)` array. Assuming it contained 64 bit floats, this array would take up about *2.9TB* of RAM! I would guess that you're actually interested in finding nearest neighbours, in which case you should look into `sklearn.neighbors`, and in particular, [`BallTree`](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html). – ali_m May 07 '16 at 00:25
  • Sample Data from DF1: Apple Texas Homes LLC Microsoft Microsoft Ireland Research Google LLC Google Auto LLC John Smith Jane Doe DF2: John Smith Jack Smith Jane Smith Apple Apple Inc Applebees Inc John Johnson Johnson & Johnson My desired output is as described in the original question - a similarity score, along with the entities with the highest matches, in descending order. – user2762934 May 10 '16 at 17:16

2 Answers2

0

Solution

I had to install the distance module because it was quicker than figuring out how to use jaccard_similarity_score in this context. I could not recreate your numbers from that function.

Install distance

pip install distance

use distance

import distance

jd = lambda x, y: 1 - distance.jaccard(x, y)
df_1.head().iloc[:, 0].apply(lambda x: df_2.head().iloc[:, 0].apply(lambda y: jd(x, y)))

The head() is there for your protection. I'm pretty sure removing them would blow up your computer as it would produce a 1.2M X 0.3M matrix.

Try this. I'm not quite sure what you want in the end. We can adjust as you gain clarity.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you, I tried your solution, and this is the output I get. [link](http://i.imgur.com/LuL84hY.png) Given that my datasets are just two one-dimensional columns, I am not sure how to interpret this result - any ideas? – user2762934 May 10 '16 at 17:42
  • Each cell is the distance between the ith element in the first data set and the jth element of the second data set. If you only want the distances calculated for those elements in the same position, i.e. distance(data set 1[0], data set 2[0]) then that even easier. I'll add an answer. – piRSquared May 10 '16 at 17:47
0

Or for comparison limited to items in the same elemental position.

import distance

jd = lambda x, y: 1 - distance.jaccard(x, y)

test_df = pd.concat([df.iloc[:, 0] for df in [df_1, df_2]], axis=1, keys=['one', 'two'])
test_df.apply(lambda x: jd(x[0], x[1]), axis=1)
piRSquared
  • 285,575
  • 57
  • 475
  • 624