0

I have a column with country where each row has more than one country listed. I want to convert each country to continent. In the past I have used country converter, but when I try to use it in this case, I get an error because there is more than one country per row.

How can I fix this?

!pip install country_converter --upgrade

import pandas as pd
import country_converter as coco
import pycountry_convert as pc

df = pd.DataFrame()
df['country']=['United States, Canada, England', 'United Kingdom, Spain, South Korea', 'Spain', 'France, Sweden']

# CONVERT COUNTRY TO ISO COUNTRY
cc = coco.CountryConverter()

# Create a list of country names for the dataframe
country = []
for name in df['country']:
    country.append(name)
    
# Converting country names to ISO 3    
iso_alpha = cc.convert(names = country, to='ISO3')

# CONVERT ISO COUNTRY TO CONTENENT
def country_to_continent(country_name):
    country_alpha2 = pc.country_name_to_country_alpha2(country_name)
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name

# converting to contenents
contenent=[]
for iso in iso_alpha:
    try:
        country_name = iso
        contenent.append(country_to_continent(country_name))
    except:
        contenent.append('other')

# add contenents to original dataframe
df['Contenent']=contenent
Rebecca James
  • 383
  • 2
  • 12
  • Where is the error happening? In `iso_alpha = cc.convert(names = country, to='ISO3')` or afterwards? – Ignatius Reilly Jul 02 '22 at 15:45
  • 1
    Where does it fail? At "iso_alpha"? FYI: You have a typo in continent – Chris Jul 02 '22 at 15:48
  • yes at that line – Rebecca James Jul 02 '22 at 15:48
  • 1
    You should get a list of iso names by doing `iso_alpha_list = [cc.convert(names=name, to='ISO3') for name in country]` Or you can just iterate through the list with a for loop, the same you did to generate the list "country" before. – Ignatius Reilly Jul 02 '22 at 15:49
  • 1
    BTW, in your example many names appear together as a single string ('United States, Canada, England' instead of 'United States', 'Canada', 'England') It's going to generate bugs when testing it. – Ignatius Reilly Jul 02 '22 at 15:51
  • @Ignatius Reilly, that's how my data is in the full dataset. I guess I would have to split them up first then? – Rebecca James Jul 02 '22 at 15:54
  • 1
    Ok, I thought you had a list per row, not many countries in a single string. Then yes, you should split it, but you're going to have trouble with composed names like United Kingdom. – Ignatius Reilly Jul 02 '22 at 15:58
  • 1
    This one may help: https://stackoverflow.com/questions/67768095/how-to-extract-country-from-a-string-in-python – Ignatius Reilly Jul 02 '22 at 16:00
  • 1
    And this one: https://stackoverflow.com/questions/48607339/how-to-extract-countries-from-a-text – Ignatius Reilly Jul 02 '22 at 16:01

2 Answers2

1

Assuming I understood you correctly, you want the result back in the DataFrame. Therefore, each row would have multiple continents matching to the corresponding country.

If so, you'll need to split each row, and then split the string so that each country can be processed separately, then join back row by row before putting back into DataFrame.

A few things to note:

  • "England" isn't found to be a country, so will be labeled as "other". If you use an IDE, the execution window will display a warning. I didn't try to fix this.
  • CountryConverter's convert will return a string if it got only one country, so have to check for the return type.
  • I moved the "def" up to the top, so main code is on the bottom.

Here is the code that works for me:

import pandas as pd
import country_converter as coco
import pycountry_convert as pc

# CONVERT ISO COUNTRY TO CONTENENT
def country_to_continent(country_name):
    country_alpha2 = pc.country_name_to_country_alpha2(country_name)
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name


# ------ MAIN -------
df = pd.DataFrame()
df['country']=['United States, Canada, England', 'United Kingdom, Spain, South Korea', 'Spain', 'France, Sweden']

# CONVERT COUNTRY TO ISO COUNTRY
cc = coco.CountryConverter()

# Create a list of country names for the dataframe
cont_list=[]
for arow in df['country']:
    country = []
    arowarr = arow.split(", ")
    for aname in arowarr:
        country.append(aname)

    #print(f'org:{arow} split:{country}')
    # Converting country names to ISO 3    
    iso_alpha = cc.convert(names = country, to='ISO3')
    #print(f'iso_alpha:{iso_alpha} type:{type(iso_alpha)}')

    # converting to contenents
    contenent=[]
    if (type(iso_alpha) == type("")):
        try:
            #print(f'   iso_alpha:{iso_alpha}')
            contenent.append(country_to_continent(iso_alpha))
        except:
            contenent.append('other')
    else:
        for iso in iso_alpha:
            try:
                #print(f'   iso:{iso}')
                contenent.append(country_to_continent(iso))
            except:
                contenent.append('other')

    # convert array back to string
    str_cont = ', '.join(contenent)
    #print(f'str_cont:{str_cont}')
    cont_list.append(str_cont)

# add contenents to original dataframe
df['Contenent']=cont_list
print(f"DF Contenent: \n{df['Contenent']}")

H3coder
  • 158
  • 7
  • Thanks @H3coder, I came to a solution myself but I think converting the array back to a list would be a good idea, I will use your solution to continue to refine mine – Rebecca James Jul 02 '22 at 18:40
  • 1
    Thanks for the accept :) Your code below is much cleaner now! I had suspected that there were possibly duplicate steps, but was just working quickly to get to a solution for you. Good job! – H3coder Jul 03 '22 at 09:42
0

With help from @Ignatius Reilly, I was able to figure this out.

I am still learning python, so splitting the string first was easy for me to understand. Since all the countries were separated by commas it worked without complication.

country_split=[]
for x in df['country']:
    country_split.append(x.split(','))

Then I realized that I could change cc.convert from 'ISO3' to 'Continent' so that really simplified the code.

the output contained duplicate continents for example, [America, America]. So I used .map(pd.unique) to remove the duplicate values.

the final code is:

!pip install country_converter --upgrade

import pandas as pd
import country_converter as coco

df = pd.DataFrame()
df['country']=['United States, Canada', 'United Kingdom, Spain, South Korea', 'Spain', 'France, Sweden']

# Create a list of country names from the dataframe
country_split=[]
for x in df['country']:
    country_split.append(x.split(','))

# Converting country names to contenent 
cc = coco.CountryConverter()
iso_alpha_list = [cc.convert(names=name, to='Continent') for name in country_split]

df['continent_split']= iso_alpha_list
df['continent']=df['continent_split'].map(pd.unique)
Rebecca James
  • 383
  • 2
  • 12