Python/Pandas matching substring in another substrings

Question

I am stuck in finding a common key stored in 2 different substings of 2 different dataframes and then output a 3rd column:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['John','Michael','Dan','George', 'Adam'], 'Code1':['AAA OO','BBB UU','JJ',np.nan,'II']})

df2 = pd.DataFrame({'Second Name':['Smith','Cohen','Moore','Kas', 'Faber'], 'code2':['UU HHH','AAA GGG',np.nan , 'TT II', np.nan]})

The expected output:

I have done my research....... and the question is really similar to this one: How to merge pandas on string contains?. However here the key has a single item, my example has 2 items in both keys.

Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — sushanth, Nov 11 '20 at 10:48
Hello no thanks, I am asking how to match a common key stored in 2 substrings, I know how to do a standard match. — Caiotru, Nov 11 '20 at 10:56
@MayankPorwal hello, they share the same AAA code. I have found a way to resolve the query but only if one key does not have another code in the same line. — Caiotru, Nov 11 '20 at 11:02

score 0 · Accepted Answer · answered Nov 18 '20 at 14:58

Assuming that your codes are always split by a whitespace.

You can use list comprehensions to check the presence of each code in the Code1 column in the Code2 column. By retrieving the indices of matching codes, we can get a Dataframe containing rows with overlapping codes.

Then we can update the original dataframe to get the intended output.

# Create a list of matching codes
list_of_matches = df1['Code1'].apply(lambda x: [
                         any([word in str(list_of_words).split() 
                              for word in str(x).split()]) 
                              for list_of_words in df2['code2']])

# Get the indices of matching codes
i, j = np.where(list_of_matches.values.tolist())

# Create a new dataframe with name and second name of rows with matching code
# And drop rows with NA, as they don't make sense
df3 = pd.DataFrame(np.column_stack([df1.loc[i], df2.loc[j]]), 
                   columns=df1.columns.append(df2.columns)).dropna()

# Create columns in your original dataframe to be able to update the dataframe
df1['Second Name'] = np.nan
df1['code2'] = np.nan

# Update dataframe with matching rows
df1.update(df3)

Output

    Name    Code1   Second Name   code2
0   John    AAA OO  Cohen         AAA GGG
1   Michael BBB UU  Smith         UU HHH
2   Dan     JJ      NaN           NaN
3   George  NaN     NaN           NaN
4   Adam    II      Kas           TT II

Python/Pandas matching substring in another substrings

1 Answers1