How do I join two dataframes based on a dataframe's row value being a substring/contained in another dataframe's row value?

Question

I have the following two dataframes -

DataFrame - A

DataFrame - B

Now, I want to perform a join on both DataFrames on the columns 'Code - A' and 'Code - B' such that the join is performed if the value in 'Code - A' is a substring of or contains the value in 'Code - B'. A sample output for this example would be as follows -

The values in 'Code - A' and 'Code - B' are of type int, which I am converting to a string for easier manipulation. And, so far, I have implemented the following code:

data_A = [['A', 100010], ['B', 200010]] 
A = pd.DataFrame(data_A, columns = ['Type - A', 'Code - A'])
data_B = [['AA', 1000], ['BB', 10001020],['CC',10001030],['DD',10001040],['EE',20001020],['FF',20001030],['GG',200010]] 
B = pd.DataFrame(data_B, columns = ['Type - B', 'Code - B'])
A['Code - A'] = A['Code - A'].astype(str)
B['Code - B'] = B['Code - B'].astype(str)  
A['join'] = 1 #creating a dummy column in Dataframe A called join
B['join'] = 1 #creating a dummy column in Dataframe B called join
cartesian = A.merge(B, on='join').drop('join', axis=1) #getting the cartesian product
cartesian['Match'] = cartesian.apply(lambda x: x['Code - A'].find(x['Code - B']), axis=1).ge(0) #checking for columns where 'Code - A' has a string which is either a substring of 'Code - B' or one where 'Code - B' is a substring of 'Code - A'
mask = cartesian['Match'] == True #filtering to see the matched items
full = cartesian.loc[mask]

However, this only returns the following Dataframe.

How should I modify my code to get the aforementioned output?

Thanks!

Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — wwii, Oct 24 '20 at 23:14
Hey, apologies for not following this rule before! I have now added the code to create the sample dataframes mentioned in the images. — lukin, Oct 24 '20 at 23:25

score 2 · Accepted Answer · answered Oct 25 '20 at 00:27

After you create the cartesian dataframe, you can query it like so:

cartesian[cartesian.apply(lambda row: row['Code - B'] in row['Code - A'] or row['Code - A'] \
  in row['Code - B'], axis = 1)]

Output:

   Type - A Code - A Type - B  Code - B
0         A   100010       AA      1000
1         A   100010       BB  10001020
2         A   100010       CC  10001030
3         A   100010       DD  10001040
11        B   200010       EE  20001020
12        B   200010       FF  20001030
13        B   200010       GG    200010

score 1 · Answer 2 · answered Oct 24 '20 at 23:38

Looking into the codes, it seems that only the first 4 digits matter, so I created a new column on each dataframe with the first 4 digits called 'key',then merged the two dataframes:

#Creating a key for each dataframe:
for i in A.index:
    A.loc[i,'key'] = str(A.loc[i,'Code - A'])[:4]

for i in B.index:
    B.loc[i,'key'] = str(B.loc[i,'Code - B'])[:4]

#merging the dataframes
A.merge(B)

Output:

    Type - A    Code - A    key      Type - B   Code - B
0   A           100010      1000     AA         1000
1   A           100010      1000     BB         10001020
2   A           100010      1000     CC         10001030
3   A           100010      1000     DD         10001040
4   B           200010      2000     EE         20001020
5   B           200010      2000     FF         20001030
6   B           200010      2000     GG         200010

How do I join two dataframes based on a dataframe's row value being a substring/contained in another dataframe's row value?

2 Answers2