0

I have two dataframes with a common key-product names, what i want to do is create a third dataframe by joining the previous two based on partial string matches with 80-90% similarity, the datasets are quite large, i had tried using tfidf from scikit-learn, but i keep losing my reference index. In below example:Mini Wireless Bluetooth Sports Stereo Headset and OnePlus 6 Sandstone Protective Case both need to come in df3, Help will be much appreciated. Output1

Example-

import pandas as pd
df1=pd.DataFrame({'Product_Name1': ['Mini  Wireless Bluetooth Sports Stereo Headset', 'VR Box 3D Smart Glass With Remote Controller', 'OnePlus 6 Sandstone Protective Case'],'Price1': [40000, 50000, 42000]})
df2=pd.DataFrame({'Product_Name2': ['Mini  Wireless Sports Stereo Headset', 'VR Box 3D Smart Glass With Remote Controller', 'OnePlus 6 1Sandstone Protective Case'], 'Price2': [40000, 50000, 42000]})
df1set=df1.set_index('Product_Name1')
df2set=df2.set_index('Product_Name2')
df3=df1set.join(df2set,how='inner')
df3
df1
df2

First dataframe

Second dataframe

  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Dec 19 '18 at 07:42
  • Please add your expected output. – Mohit Motwani Dec 19 '18 at 08:20

1 Answers1

0

What you need is fuzzy matching. Fuzzy matching is used to compare strings which are very similar to each other. You can use fuzzy wuzzy for this.

Example of fuzzy matching

from fuzzywuzzy import process
process.extractOne('Mini Wireless Bluetooth Sports Stereo Headset', df2.Product_Name2)

('Mini  Wireless Sports Stereo Headset', 95, 0)

This value has a 95% match.

I have changed the order of df2 for demonstration.

df1=pd.DataFrame({'Product_Name1': ['Mini  Wireless Bluetooth Sports Stereo Headset', 
                                    'VR Box 3D Smart Glass With Remote Controller',
                                    'OnePlus 6 Sandstone Protective Case'],
                  'Price1': [40000, 50000, 42000]})

df1

    Product_Name1                                   Price1
0   Mini Wireless Bluetooth Sports Stereo Headset   40000
1   VR Box 3D Smart Glass With Remote Controller    50000
2   OnePlus 6 Sandstone Protective Case             42000


df2=pd.DataFrame({'Product_Name2': ['Mini  Wireless Sports Stereo Headset',
                                    'OnePlus 6 1Sandstone Protective Case',
                                    'VR Box 3D Smart Glass With Remote Controller'],
                  'Price2': [40000, 42000, 50000]})

df2

     Product_Name2                                  Price2
0   Mini Wireless Sports Stereo Headset             40000
1   OnePlus 6 1Sandstone Protective Case            42000
2   VR Box 3D Smart Glass With Remote Controller    50000

Now we write a function which matches each value of df1 Product_Name1 with every value of df2 Product_Name2 and return the index of df2 where it matched the highest.

def fuzzy(x):
    closest_match = process.extractOne(x, df2.Product_Name2.values)[0]
    index = pd.Index(df2.Product_Name2).get_loc(closest_match)
    return index

The we use apply to get the result

df1['match'] = df1['Product_Name1'].apply(fuzzy)
df1

Product_Name1                                      Price1   match
0   Mini Wireless Bluetooth Sports Stereo Headset   40000   0
1   VR Box 3D Smart Glass With Remote Controller    50000   2
2   OnePlus 6 Sandstone Protective Case            42000    1

As I don't have your expected output, I'm gonna merge them.

pd.merge(df1, df2, left_on='match', right_on=df2.index)

   Product_Name1                                  Price1    match   Product_Name2   Price 2 

0   Mini Wireless Bluetooth Sports Stereo Headset   40000   0        Mini Wireless Sports Stereo Headset            40000
1   VR Box 3D Smart Glass With Remote Controller    50000   2        VR Box 3D Smart Glass With Remote Controller    50000
2   OnePlus 6 Sandstone Protective Case             42000   1        OnePlus 6 1Sandstone Protective Case        42000

Let me know if it works for you

Mohit Motwani
  • 4,662
  • 3
  • 17
  • 45
  • Thanks for the code, however the results don't match . For example the match suggests- Huawei Nova 2i - Smartphone - 5.50 - 4GB RAM - 64GB ROM - 16MP Camera - Black as a match to Apple iPhone 7 Plus - 128 GB,where as there was a different product for Huawei in that range – Rukesh Dutta Dec 19 '18 at 12:16