2

I have two df with two string columns as below:

Df1: Original df has 2000 rows of Names

Id    Name
1     Paper
2     Paper
3     Scissors
4     Mat
5     Cat
6     Cat

2nd Df: Original df has 1000+ Item_Name

Item_ID   Item_Name
1         Paper Bag
2         wallpaper
3         paper
4         cat cage

I need strings in column Name that are in column Item_Name

1st Approach: By using str.contains:

I know how to match string if their was one column and a couple of strings to be matched like below:

df[df['Name'].str.contains("paper|cat", na=False)]

But how to do when there are two columns of string (name & Item_name) to be matched?

2nd Approach: Fuzzywuzzy

matched = []
for row in df1.index:
    name = df1.get_value(row,"Name")
    for columns in df2.index:
        item_name=df2.get_value(columns,"Item_Name")
        matched_token=fuzz.token_sort_ratio(name,item_name)
        if matched_token> 80:
            matched.append([name,item_name,matched_token])

The problem will be, it will be slow and my intended output what I want is litle more which I get from fuzzywuzzy. The output looks like:

Id Name     Item_ID
1  Paper     1,2,3
2  Paper     1,2,3
3  Scissors  NA 
4  Mat       NA 
5  Cat       4
6  Cat       4 

Summarize:

  1. How to do str.contains if their are two df with different column name
  2. How to transform the df to get the above intended output
jpp
  • 159,742
  • 34
  • 281
  • 339
Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51

3 Answers3

3

You can use pd.Series.apply with a custom function:

def matcher(x):
    res = df2.loc[df2['Item_Name'].str.contains(x, regex=False, case=False), 'Item_ID']
    return ','.join(res.astype(str))

df1['Item_ID'] = df1['Name'].apply(matcher)

print(df1)

   Id      Name Item_ID
0   1     Paper   1,2,3
1   2     Paper   1,2,3
2   3  Scissors        
3   4       Mat        
4   5       Cat       4
5   6       Cat       4

There are ways you can make this more efficient:

  • Only work on unique items in df1['Name']: apply is a row-wise loop.
  • Use list comprehensions instead of pd.Series.apply. Both are Python-level loops, but list comprehensions often outperform versus Pandas str methods.

But the above don't improve algorithmic complexity. For orders of magnitude better improvement, you should consider a trie-based algorithm, such as this answer utilizing the Aho–Corasick algorithm.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • your item_ID column is String – min2bro Nov 27 '18 at 11:16
  • @min2bro, Yes, so is `Item_ID` in OP's desired output. – jpp Nov 27 '18 at 11:16
  • @jpp: I will try this and get back to you – Rahul Agarwal Nov 27 '18 at 11:57
  • When I tried for a similar cell, it is matching with different Item_ID. Since you mentioned that it works only on unique df1['Name']. Should I do `pd.Series(df1['Name'].unique()).apply(matcher)` ? – Rahul Agarwal Nov 27 '18 at 14:18
  • 1
    `Since you mentioned that it works only on unique`. No, that's not what my post says. It says *if you want to improve performance*, remove duplicate values first. I can't replicate your problem with `matching with different Item_ID`. – jpp Nov 27 '18 at 14:19
  • One more issue, I guess this is with how `str.contains` work, It has not matched anything where their are two words. For example when `Name` is "Paper Bag" and `Item_Name` is "Paper". I am looking for those matches also. – Rahul Agarwal Nov 27 '18 at 15:30
  • Since `Paper` is in `Paper Bag`, it does match. In my output, Paper does map to `1`, and `1` refers to `Paper Bag`. – jpp Nov 27 '18 at 15:38
  • What I am trying to say is,it is opposite of what is presented in the sample df above – Rahul Agarwal Nov 27 '18 at 15:39
  • Ah, I see what you're saying now. You're right, `str.contains` works one way only. If you need both ways, that's another question. – jpp Nov 27 '18 at 15:40
  • Ahh!! Is their any way you can tweak you solutions or should I ask the new question? – Rahul Agarwal Nov 27 '18 at 15:43
  • 1
    @RahulAgarwal, Unfortunately, I think it's beyond a tweak, best to ask a new question (with a complete example so people understand what you mean!). – jpp Nov 27 '18 at 15:49
1

You can use df.apply here

def get_vals(df):
    return ','.join(map(str, df2.loc[df2['Item_Name'].str.contains(df['Name'], case=False),]['Item_ID'].tolist()))

df1['Item_ID'] = df1.apply(get_vals, axis=1)

Output:

Id     Name  Item_ID
1     Paper   1,2,3
2     Paper   1,2,3
3  Scissors        
4       Mat        
5       Cat       4
6       Cat       4

Believe this will give you your desired result

Shrey
  • 1,242
  • 1
  • 13
  • 27
0
df=pd.DataFrame({'ID':[1,2,3,4,5,6],'Name':['paper','paper','scissors','mat','cat','cat']})
df1=pd.DataFrame({'ID':[1,2,3,4],'Name':['paper bag','wallpaper','paper','cat cage']})


import numpy as np
def lookup_prod(ip):
    lst=[]
    for idx,row in df1.iterrows():
        if ip in row['Name']:
            lst.append(row['ID'])    
    if not lst:
        return np.NaN
    return lst

df['Item_ID'] = df['Name'].apply(lookup_prod)

Output:

 ID  Name         Item_ID
0   1   paper     [1, 2, 3]
1   2   paper     [1, 2, 3]
2   3   scissors    NaN
3   4   mat         NaN
4   5   cat         [4]
5   6   cat         [4]
min2bro
  • 4,509
  • 5
  • 29
  • 55
  • It is giving me all rows as Nan. I your example you have taken column names of both df as same. Can you change the same, so I can understand which ID and which Name you are referring in your code – Rahul Agarwal Nov 27 '18 at 16:15