0

I have the following two dataframes -

DataFrame - A

enter image description here

DataFrame - B

enter image description here

Now, I want to perform a join on both DataFrames on the columns 'Code - A' and 'Code - B' such that the join is performed if the value in 'Code - A' is a substring of or contains the value in 'Code - B'. A sample output for this example would be as follows -

enter image description here

The values in 'Code - A' and 'Code - B' are of type int, which I am converting to a string for easier manipulation. And, so far, I have implemented the following code:

data_A = [['A', 100010], ['B', 200010]] 
A = pd.DataFrame(data_A, columns = ['Type - A', 'Code - A'])
data_B = [['AA', 1000], ['BB', 10001020],['CC',10001030],['DD',10001040],['EE',20001020],['FF',20001030],['GG',200010]] 
B = pd.DataFrame(data_B, columns = ['Type - B', 'Code - B'])
A['Code - A'] = A['Code - A'].astype(str)
B['Code - B'] = B['Code - B'].astype(str)  
A['join'] = 1 #creating a dummy column in Dataframe A called join
B['join'] = 1 #creating a dummy column in Dataframe B called join
cartesian = A.merge(B, on='join').drop('join', axis=1) #getting the cartesian product
cartesian['Match'] = cartesian.apply(lambda x: x['Code - A'].find(x['Code - B']), axis=1).ge(0) #checking for columns where 'Code - A' has a string which is either a substring of 'Code - B' or one where 'Code - B' is a substring of 'Code - A'
mask = cartesian['Match'] == True #filtering to see the matched items
full = cartesian.loc[mask]

However, this only returns the following Dataframe.

enter image description here

How should I modify my code to get the aforementioned output?

Thanks!

lukin
  • 55
  • 2
  • 7
  • 1
    Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – wwii Oct 24 '20 at 23:14
  • 1
    Hey, apologies for not following this rule before! I have now added the code to create the sample dataframes mentioned in the images. – lukin Oct 24 '20 at 23:25

2 Answers2

2

After you create the cartesian dataframe, you can query it like so:

cartesian[cartesian.apply(lambda row: row['Code - B'] in row['Code - A'] or row['Code - A'] \
  in row['Code - B'], axis = 1)]

Output:

   Type - A Code - A Type - B  Code - B
0         A   100010       AA      1000
1         A   100010       BB  10001020
2         A   100010       CC  10001030
3         A   100010       DD  10001040
11        B   200010       EE  20001020
12        B   200010       FF  20001030
13        B   200010       GG    200010
jschnurr
  • 1,181
  • 6
  • 8
1

Looking into the codes, it seems that only the first 4 digits matter, so I created a new column on each dataframe with the first 4 digits called 'key',then merged the two dataframes:

#Creating a key for each dataframe:
for i in A.index:
    A.loc[i,'key'] = str(A.loc[i,'Code - A'])[:4]

for i in B.index:
    B.loc[i,'key'] = str(B.loc[i,'Code - B'])[:4]

#merging the dataframes
A.merge(B)

Output:

    Type - A    Code - A    key      Type - B   Code - B
0   A           100010      1000     AA         1000
1   A           100010      1000     BB         10001020
2   A           100010      1000     CC         10001030
3   A           100010      1000     DD         10001040
4   B           200010      2000     EE         20001020
5   B           200010      2000     FF         20001030
6   B           200010      2000     GG         200010
jlb_gouveia
  • 603
  • 3
  • 11