My issue here is joining two disparate name fields (without so many exceptions) using python/pandas.
I have two dataframes where I would like to merge on 'short_name' or 'long_name' of df 1 and 'name' of df2.
df 1:
short_name long_name age height_cm \
0 L. Messi Lionel Andrés Messi Cuccittini 32 170
1 Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro 34 187
2 Neymar Jr Neymar da Silva Santos Junior 27 175
3 J. Oblak Jan Oblak 26 188
4 E. Hazard Eden Hazard 28 175
5 K. De Bruyne Kevin De Bruyne 28 181
df 2:
name 2014 2015 2016 2017 2018 \
0 Kylian Mbappé NaN 0.0570 1.9238 51.3000 175.5600
1 Neymar 74.100 98.8000 114.0000 133.0000 205.2000
2 Mohamed Salah 14.820 17.1000 26.6000 39.9000 144.4000
3 Harry Kane 3.420 22.8000 41.8000 72.2000 159.6000
4 Eden Hazard 53.010 74.1000 76.0000 82.6500 143.6400
5 Lionel Messi 136.800 136.8000 136.8000 136.8000 197.6000
I modified df2's 'name' column to follow the (first initial, last name convention) of df1's 'short_name' column. Unfortunately it led to many exceptions since df2's 'name' column doesn't always follow that convention (examples include, 'Neymar Jr' (expected: "Neymar"), Cristiano Ronaldo (expected: C. Ronaldo), and Roberto Firmino (expected: R. Firmino).
The only other thing I can think of is using substring matching. Is there a way to split df2's 'name' column into separate substrings and then see if df1's 'long_name' contains all of those elements (ie seeing if "Lionel Andrés Messi Cuccittini" has both "Lionel" and "Messi" from d1's name and then merging on it)?
After searching for a while, it doesn't seem like something in pandas functionality since it splits it into several columns. I also don't know if merges can take conditions like substring matches. Everything I've thought of doesn't address these exceptions/non-matches except for substring matching.
Edit: As another user said, it might be worth splitting the names into sets and making sure all the substrings in "name" in df2 is found in "long_name" in df1. But I cannot seem to find a way to do that without splitting the string into separate columns.
SOLVED: What seems to work is isolating and creating a copy of the columns (making them series) then splitting the names into lists and seeing if the shorter names were subsets of the longer names in a double for-loop. Here is my code:
names = df1['name']
long_names = df2['long_name']
for i in range(len(names)):
name_list = names[i].split()
for j in range(len(long_names)):
long_name_list = long_names[j].split()
if set(name_list).issubset(long_name_list):
df2.loc[j, "long_name"] = df1.loc[i, "name"]