I have two df with two string columns as below:
Df1: Original df has 2000 rows of Names
Id Name
1 Paper
2 Paper
3 Scissors
4 Mat
5 Cat
6 Cat
2nd Df: Original df has 1000+ Item_Name
Item_ID Item_Name
1 Paper Bag
2 wallpaper
3 paper
4 cat cage
I need strings in column Name
that are in column Item_Name
1st Approach: By using str.contains
:
I know how to match string if their was one column and a couple of strings to be matched like below:
df[df['Name'].str.contains("paper|cat", na=False)]
But how to do when there are two columns of string (name & Item_name) to be matched?
2nd Approach: Fuzzywuzzy
matched = []
for row in df1.index:
name = df1.get_value(row,"Name")
for columns in df2.index:
item_name=df2.get_value(columns,"Item_Name")
matched_token=fuzz.token_sort_ratio(name,item_name)
if matched_token> 80:
matched.append([name,item_name,matched_token])
The problem will be, it will be slow and my intended output what I want is litle more which I get from fuzzywuzzy. The output looks like:
Id Name Item_ID
1 Paper 1,2,3
2 Paper 1,2,3
3 Scissors NA
4 Mat NA
5 Cat 4
6 Cat 4
Summarize:
- How to do str.contains if their are two df with different column name
- How to transform the df to get the above intended output