Pandas generates a column based by matching the dataframe columns to multiple other columns

Question

I am trying to find a neat solution to my problem. I have three tables:

Code DF
Code1 Code2 Code3 Code4 Code5
Eur xxx xxx xxx xxx
xxx xxx xxx ESP xxx
ASI xxx xxx xxx xxx
xxx BRA xxx xxx xxx
xxx AUS xxx xxx xxx
xxx xxx NOR xxx xxx
xxx xxx xxx PRT xxx
xxx xxx xxx xxx SGP


Country1 DF
Country-Code Region
Eur Europe
ASI Asia
BRA America
AUS Asia
NOR Europe

Country2 DF
Country Code    Region
ESP Europe
PRT Europe
SGP Asia
ASI Asia

So what I want to do is, Create a Fifth Column which is Region.First I want to check the values in Code5 and Code4 respectively and if one of the codes matches Country2 dataframe then put its coresponding Region value in the Region column. If no matching code is found in Code5, go to Code4, if nothing there Code3 and etc. Code5 abbreviation and Code4 needs to look in Country2 dataframe and Code3,Code2 and Code1 need to look in Country1 dataframe. Just to clarify "xxx" can either be some other 3 letter abbreviation or empty space. There could also be Code and Regions that are similar between Country1 DF and Country2 DF, but there are some values that Code4 and Code5 should not match with Country1 df, that is why there are two different dataframes for matching. The case here is EUR, in Code1,Code2,Code3 is region Europe, but in Code4,Code5 it is currency and I do not want it to map to Europe if it consists in one of those two columns. The end case scenario needs to be something like this:

`Code1  Code2   Code3   Code4   Code5   Region
Eur xxx xxx xxx xxx Europe
xxx xxx xxx ESP xxx Europe
ASI xxx xxx xxx xxx Asia
xxx BRA xxx xxx xxx America
xxx AUS xxx xxx xxx Asia
xxx xxx NOR xxx xxx Europe
xxx xxx xxx PRT xxx Europe
xxx xxx xxx xxx SGP Europe
`

I feel like this might be an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). How did you obtain this data? — Sebastian Mendez, Nov 28 '17 at 04:31

score 0 · Answer 1 · answered Nov 28 '17 at 04:46

You might do this with a list comprehension:

def determine_region(df_row):
    # if else chain to make a decision for each row
    # or maybe you could use python builtin set to make it 
    # more semantic

# capture each item into a list with a comprehension
x = [ determine_region(x) for x in CodeDF ]
# append the data into a new column named region
CodeDF.loc[:,'Region'] = pd.Series(x)

Other Resources

Appending Column to Pandas DF

List Comprehensions

Sets and Operations with Sets

You could use `df.apply` instead. – Sebastian Mendez Nov 28 '17 at 04:54 — Sebastian Mendez, Nov 28 '17 at 04:54

score 0 · Accepted Answer · answered Nov 28 '17 at 04:51

A better method to store your country code mappings would be in a dictionary. I'm going to assume that country_dict1, country_dict2 are the mappings of code:region for each dataframe, respectively:

def determine_region(row):
    for item in row[:-3:-1]:
        if item in country_dict1:
            return country_dict1.get(item)
    for item2 in row[-3::-1]:
        if item2 in country_dict2:
            return country_dict2.get(item2)
    return pd.np.nan

df['Region'] = df.apply(determine_region, axis=1)

Hello, it worked perfectly as I needed it. Thank you. – soky Nov 28 '17 at 10:30 — soky, Nov 28 '17 at 10:30

Pandas generates a column based by matching the dataframe columns to multiple other columns

2 Answers2