-2

I am stuck in finding a common key stored in 2 different substings of 2 different dataframes and then output a 3rd column:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name':['John','Michael','Dan','George', 'Adam'], 'Code1':['AAA OO','BBB UU','JJ',np.nan,'II']})

df2 = pd.DataFrame({'Second Name':['Smith','Cohen','Moore','Kas', 'Faber'], 'code2':['UU HHH','AAA GGG',np.nan , 'TT II', np.nan]})

The expected output:

enter image description here

I have done my research....... and the question is really similar to this one: How to merge pandas on string contains?. However here the key has a single item, my example has 2 items in both keys.

Caiotru
  • 335
  • 1
  • 10
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – sushanth Nov 11 '20 at 10:48
  • Hello no thanks, I am asking how to match a common key stored in 2 substrings, I know how to do a standard match. – Caiotru Nov 11 '20 at 10:56
  • @MayankPorwal hello, they share the same AAA code. I have found a way to resolve the query but only if one key does not have another code in the same line. – Caiotru Nov 11 '20 at 11:02

1 Answers1

0

Assuming that your codes are always split by a whitespace.

You can use list comprehensions to check the presence of each code in the Code1 column in the Code2 column. By retrieving the indices of matching codes, we can get a Dataframe containing rows with overlapping codes.

Then we can update the original dataframe to get the intended output.

# Create a list of matching codes
list_of_matches = df1['Code1'].apply(lambda x: [
                         any([word in str(list_of_words).split() 
                              for word in str(x).split()]) 
                              for list_of_words in df2['code2']])

# Get the indices of matching codes
i, j = np.where(list_of_matches.values.tolist())

# Create a new dataframe with name and second name of rows with matching code
# And drop rows with NA, as they don't make sense
df3 = pd.DataFrame(np.column_stack([df1.loc[i], df2.loc[j]]), 
                   columns=df1.columns.append(df2.columns)).dropna()

# Create columns in your original dataframe to be able to update the dataframe
df1['Second Name'] = np.nan
df1['code2'] = np.nan

# Update dataframe with matching rows
df1.update(df3)

Output

    Name    Code1   Second Name   code2
0   John    AAA OO  Cohen         AAA GGG
1   Michael BBB UU  Smith         UU HHH
2   Dan     JJ      NaN           NaN
3   George  NaN     NaN           NaN
4   Adam    II      Kas           TT II
Rik Kraan
  • 586
  • 2
  • 16