Join two pandas tables on multiple strings

Question

I have two pandas dataframe : df1 and df2. df1 contains multiple emails of the customer and I want to match it with df2 to see how many customer did a test with the company by looking at if any of the emails is present in df1 is in df2.

I tried to do .str.split(";", expand=True) to split the email ids and use pd.merge to join on multiple email ids but it's too lengthy. Posting it here to find a better solution.

df1

myid      emails                                                                      price

1001     pikachu@icloud.com;charizard@gmail.com                                        1
1002     bulbasaur@gmail.com                                                           2
1003     meowth@gmail.com;james@yahoo.com;jesse@yahoo.com;wobbuffet@yahoo.com          8           
1004     abra@gmail.com;ash@yahoo.com                                                  7
1005     squirtle@gmail.com                                                            9

df2

tr_id      latest_em                                     test

101     pichu@icloud.com; paul@gmail.com                  12                            
102     ash@yahoo.com                                     13            
103     squirtle@gmail.com                                16
104     charmander@gmail.com                              18                               
105     ash@yahoo.com;misty@yahoo.com                     10

Expected Output :

myid      emails                      price   tr_id   latest_em                      test
1004     abra@gmail.com;ash@yahoo.com  7      102     ash@yahoo.com                   13
1004     abra@gmail.com;ash@yahoo.com  7      105     ash@yahoo.com;misty@yahoo.com   10
1005     squirtle@gmail.com            9      103     squirtle@gmail.com              16

Why are there multiple email addresses in a single field? Can you break each email address out to single rows? — itprorh66, Aug 19 '22 at 12:23
Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. — itprorh66, Aug 19 '22 at 12:24

score 0 · Accepted Answer · answered Aug 19 '22 at 12:31

You can split, explode, then merge:

(df1
 .assign(key=df1['emails'].str.split(';\s*')).explode('key')
 .merge(df2.assign(key=df2['latest_em'].str.split(';\s*')).explode('key'),
        on='key'
       )
 .drop(columns='key')
)

output:

   myid                        emails  price  tr_id                      latest_em  test
0  1004  abra@gmail.com;ash@yahoo.com      7    102                  ash@yahoo.com    13
1  1004  abra@gmail.com;ash@yahoo.com      7    105  ash@yahoo.com;misty@yahoo.com    10
2  1005            squirtle@gmail.com      9    103             squirtle@gmail.com    16

Join two pandas tables on multiple strings

1 Answers1