1

So I know that you can't use if statements on a pandas dataframe according to this post or you will get this error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). So how do you apply a function with multiple conditions?

I have a dataframe that was exported CRM data and contains a countries column that I need to convert to 2 letter country codes (United States to US and so on).

Here is a list of the unique values contained in the column of countries:

['United States', 'Canada', 'Australia', 'United Kingdom', 'US',
   'Germany', 'New Zealand', 'Netherlands', 'Mexico', 'France',
   'Ireland', 'Dominican Republic', 'Puerto Rico', 'Taiwan', 'USA',
   '1', 'united States', 'United Staes', 'United State', 'usa',
   'United Sates', 'United Stated', 'usaa', 'Unite States', 'nv',
   'canada', 'Pakistan']

My solution was to try something like this:

def country_codes(country):
    if country.str.contains(r'(United Kingdom)'):
        return 'GB'
    elif country.str.contains(r'(Canada|canada)'):
        return 'CA'
    elif country.str.contains(r'(Australia)'):
        return 'AU'
    elif country.str.contains(r'(United|US|USA|State|usa)'):
        return 'US'
    elif country.str.contains(r'(Germany)'):
        return 'DE'
    elif country.str.contains(r'(New Zealand)'):
        return 'NZ'
    elif country.str.contains(r'(Netherlands)'):
        return 'NL'
    elif country.str.contains(r'(Mexico)'):
        return 'MX'
    elif country.str.contains(r'(France)'):
        return 'FR'
    elif country.str.contains(r'(Ireland)'):
        return 'IE'
    elif country.str.contains(r'(Dominican)'):
        return 'DO'
    elif country.str.contains(r'(Puerto)'):
        return 'PR'
    elif country.str.contains(r'(Taiwan)'):
        return 'TW'
    else:
        return country 

but upon trying df.apply(country_codes) I am getting the same ValueError. If there's an easier way to do this without regex matching, I'm open to that as well.

Brian
  • 2,163
  • 1
  • 14
  • 26
  • 1
    I would just create a dictionary `{'United Kingdom': 'GB', ...}` and then just create a new column by mapping the country column. – ALollz Apr 19 '18 at 19:49

1 Answers1

1

One efficient way is to create a dictionary and then iterate the dictionary while updating the dataframe:

d = {'United Kingdom': 'GB', 'Canada|canada': 'CA',
     'Australia': 'AU', 'Untied|US|State|USA': 'US'}

df = pd.DataFrame({'Country': ['US', 'USA', 'United Kingdom', 'canada', 'Australia']})

for k, v in d.items():
    df.loc[df['Country'].str.contains(k), 'Code'] = v

print(df)

#           Country Code
# 0              US   US
# 1             USA   US
# 2  United Kingdom   GB
# 3          canada   CA
# 4       Australia   AU
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Yes! That worked perfectly and makes a ton of sense. Thanks so much. – Trevor Theodore Apr 23 '18 at 17:28
  • Question though, does using `.str.contains(k)` essentially turn the key values of `d` into a regular expression? I'm confused as to how it would work with the multivariates and pipe characters otherwise. – Trevor Theodore Apr 23 '18 at 17:47
  • The function is pandas specific. You can turn off regex by passing regex Boolean argument set to False. But, yes, regex is enabled by default. – jpp Apr 23 '18 at 17:55