0

I know, there have been a number of very close examples, but I can't make them work for me. I want to add a column from another dataframe based on partial string match: The one string is contained in the other, but not necessarily at the beginning. Here is an example:

df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})    
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'], 
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})

df should get the continent from df2 attached to each 'citizenship' based on the string match / merge. I have been trying to apply the solution mentioned here Pandas: join on partial string match, like Excel VLOOKUP, but cannot get it to work

def get_continent(x):

     return df2.loc[df2['Country_Name'].str.contains(x), df2['Continent_Name']].iloc[0]

df['Continent_Name'] = df['citizenship'].apply(get_continent)

But it gives me a key error

KeyError: "None of [Index(['Asia', 'Europe', 'Antarctica', 'Africa', 'Oceania', 'Europe', 'Africa',\n       'North America', 'Europe', 'Asia',\n       ...\n       'Asia', 'South America', 'Oceania', 'Oceania', 'Asia', 'Africa',\n       'Oceania', 'Asia', 'Asia', 'Asia'],\n      dtype='object', length=262)] are in the [columns]"

Anybody knows what is going on here?

Papayapap
  • 212
  • 2
  • 12

2 Answers2

1

One way you could do this is create a citizenship column in df2 and use that to join the dataframes together. I think the easiest way to make this column would be to use regex.

citizenship_list = df['citizenship'].unique().tolist()
citizenship_regex = r"(" + r"|".join(citizenship_list) + r")"
df2["citizenship"] = df2["Country_Name"].str.extract(citizenship_regex).iloc[:, 0]
joined_df = df.merge(df2, on=["citizenship"], how="left")
print(joined_df)

Then you can reduce this to select just the columns you want.

Also, you probably want to clean both the citizenship and Country_Name columns by running df['citizenship'] = df['citizenship'].str.lower()on them so that you don't missing something due to case.

benji
  • 376
  • 1
  • 7
1

I can see two issues with the code in your question:

  1. In the function return line, you'll want to remove the df2[] bit in the second positional argument to df2.loc, to leave the column name as a string: df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
  2. It then seems like the code from the linked answer only works when there is always a match between "country name" in df2 and "citizenship" in df.

So this works for example:

df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']})    
df2 = pd.DataFrame({'Country_Name': ['Algeria', 'Andorra', 'Bahrain', 'Spain'], 
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})


def get_continent(x):

     return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]

df['Continent_Name'] = df['citizenship'].apply(get_continent)

#   citizenship Continent_Name
# 0    Algeria  Africa
# 1    Andorra  Europe
# 2    Bahrain  Asia
# 3    Spain    Europe

If you want to get the original code to work, you could put in a try/except:

df = pd.DataFrame({'citizenship': ['Algeria', 'Andorra', 'Bahrain', 'Spain']}) 
df2 = pd.DataFrame({'Country_Name': ['Algeria, Republic of', 'Andorra', 'Kingdom of Bahrain', 'Russia'], 
'Continent_Name': ['Africa', 'Europe', 'Asia', 'Europe']})

def get_continent(x):
    try:
        return df2.loc[df2['Country_Name'].str.contains(x), 'Continent_Name'].iloc[0]
    except IndexError:
        return None

df['Continent_Name'] = df['citizenship'].apply(get_continent)


#   citizenship Continent_Name
# 0  Algeria      Africa
# 1  Andorra      Europe
# 2  Bahrain      Asia
# 3  Spain        None
user6386471
  • 1,203
  • 1
  • 8
  • 17
  • @Felix, I've just updated the answer in case you wanted to get the original code to work :D – user6386471 Nov 20 '20 at 16:51
  • Thanks! So the IndexError occurs if the the df.loc did not find anything if I am not mistaken. Still struggling to understand that line in the function however :( May I ask your for help here as well? So the loc function before the comma returns for all rows the boolean with True if x is contained in the string of df2['Country_Name'] for all rows in df2['Country_Name'] and returns 'Continent_Name' if true? And what is the .iloc[0] doing? Sorry to spam you.. – Papayapap Nov 20 '20 at 17:04
  • No worries at all - yep, so if you were to drop the `iloc[0]` from the end, you would get a pandas Series object, and because all of the rows will be the same (i.e. they will all be the same Continent_Name as per the filter), you can just take the first row (`.iloc[0]`) to get the value. – user6386471 Nov 20 '20 at 17:10
  • I guess it's also worth mentioning that in our case here, when I say all the rows in the pandas Series will be the same, because we have just one value in df2 that will match the input string, the Series will only have one row, but we still need to extract the value by using `.iloc[0]`. – user6386471 Nov 20 '20 at 17:16
  • Oh I see, awesome! And your bad saying no worries, I will try it again- why did you need to change `df['Continent_Name']` to `'Continent_Name'`? After all, I wanted to extract string value from that column not the string literal `'Continent_Name'`? I would have expected `'Continent_Name'` to return just the word `Continent_Name`. Promised, this is my last question :D – Papayapap Nov 20 '20 at 17:16
  • Haha! So when you use the dataframe `.loc[r,c]` method, you're usually specifying the row index label (r) and the column label (c) that will then return the value at (r,c). Pandas also allows you to specify a boolean for r however, which means it will return the rows wherever the condition is True (which is what we're doing here). Then you only still need the column label, c, rather than the data object itself. – user6386471 Nov 20 '20 at 17:22