Merging dataframes based on matching substrings between two disparate columns

Question

My issue here is joining two disparate name fields (without so many exceptions) using python/pandas.

I have two dataframes where I would like to merge on 'short_name' or 'long_name' of df 1 and 'name' of df2.

df 1:

           short_name                            long_name  age  height_cm  \
0            L. Messi       Lionel Andrés Messi Cuccittini   32        170   
1   Cristiano Ronaldo  Cristiano Ronaldo dos Santos Aveiro   34        187   
2           Neymar Jr        Neymar da Silva Santos Junior   27        175   
3            J. Oblak                            Jan Oblak   26        188   
4           E. Hazard                          Eden Hazard   28        175   
5        K. De Bruyne                      Kevin De Bruyne   28        181

df 2:

                      name     2014      2015      2016      2017      2018  \
0             Kylian Mbappé      NaN    0.0570    1.9238   51.3000  175.5600   
1                    Neymar   74.100   98.8000  114.0000  133.0000  205.2000   
2             Mohamed Salah   14.820   17.1000   26.6000   39.9000  144.4000   
3                Harry Kane    3.420   22.8000   41.8000   72.2000  159.6000   
4               Eden Hazard   53.010   74.1000   76.0000   82.6500  143.6400   
5              Lionel Messi  136.800  136.8000  136.8000  136.8000  197.6000

I modified df2's 'name' column to follow the (first initial, last name convention) of df1's 'short_name' column. Unfortunately it led to many exceptions since df2's 'name' column doesn't always follow that convention (examples include, 'Neymar Jr' (expected: "Neymar"), Cristiano Ronaldo (expected: C. Ronaldo), and Roberto Firmino (expected: R. Firmino).

The only other thing I can think of is using substring matching. Is there a way to split df2's 'name' column into separate substrings and then see if df1's 'long_name' contains all of those elements (ie seeing if "Lionel Andrés Messi Cuccittini" has both "Lionel" and "Messi" from d1's name and then merging on it)?

After searching for a while, it doesn't seem like something in pandas functionality since it splits it into several columns. I also don't know if merges can take conditions like substring matches. Everything I've thought of doesn't address these exceptions/non-matches except for substring matching.

Edit: As another user said, it might be worth splitting the names into sets and making sure all the substrings in "name" in df2 is found in "long_name" in df1. But I cannot seem to find a way to do that without splitting the string into separate columns.

SOLVED: What seems to work is isolating and creating a copy of the columns (making them series) then splitting the names into lists and seeing if the shorter names were subsets of the longer names in a double for-loop. Here is my code:

names = df1['name']
long_names = df2['long_name']

for i in range(len(names)):
    name_list = names[i].split()
    for j in range(len(long_names)):
        long_name_list = long_names[j].split()
        if set(name_list).issubset(long_name_list):
            df2.loc[j, "long_name"] = df1.loc[i, "name"]

Does this answer your question? [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) — mindless-overflow, Jan 22 '20 at 02:59
Unfortuantely, I don't think so. The issue is merging by matching substrings, which I don't think is addressed in the answers in that question. — exlo, Jan 22 '20 at 03:30
You could try to split df2["name"] and df1["long_name"] by a blank space, and then try to get a match using sets. This is what i mean: ``` x = set("Lionel Messi".split()) y = set("Lionel Andrés Messi Cuccittini".split()) ``` then, you can check if set(x).issubset(y) — Jorge Avila, Jan 22 '20 at 03:39
@JorgeAvila I'm trying to use this on the dfs and I'm running I'm unable to make the splits into sets, they're resulting in new columns. Could you provide an example of code using sets with dataframes/series? — exlo, Jan 22 '20 at 04:45

Merging dataframes based on matching substrings between two disparate columns

0 Answers0