1

I want to create a column in a dataframe that is conditionally filled with values. Basically my dataframe loks like this

  Origin     X
0 Guatemala  x
1 China      x
2 Kenya      x
3 Venezuela  x
4 Bangladesh x

What I want to do now is create an additional column 'Continent', which adds the continent dependent on the country. My result would look like this:

 Origin      X  Continent
0 Guatemala  x  South america
1 China      x  Asia
2 Kenya      x  Africa
3 Venezuela  x  South america
4 Bangladesh x  Asia

I have tried the following codes to accieve what i want:

def GetContinents(x):
    if x['Origin']== 'Thailand' or 'Indonesia' or 'China' or 'Japan' or 'Bangladesh':
        return 'Asia'
    elif x['Origin']== 'Boliva' or 'Guatemala' or 'Venezuela' or 'Mexico' or 'Argentinia':
        return 'South America'
    elif x['Origin']== 'Guinea Bissau' or 'Egypt' or 'Zaire' or 'Kenya':
        return 'Africa'
    else:
        return 'unknown'

df['Continent']= df.apply(GetContinents, axis=1)

This one fills all the columns in 'continent' with 'Asia' mysteriously.

df['Continent'] = np.where(df['Origin'] == 'Bangladesh', 'Asia', 'unknown')

This one works fine in terms that it fills 'Asia' into the right column and unknown into all others, but when I try to make something like df['Continent'] = np.where(df['Origin'] == 'Bangladesh' or 'China', 'Asia', 'unknown') I get an error.

So basically my question is: how can I fullfill my if condition with different values?

sequence_hard
  • 5,115
  • 10
  • 30
  • 50

1 Answers1

1

You can construct the lists for each continent and apply a func:

In [35]:
asia = ['Thailand','Indonesia','China','Japan','Bangladesh']
south_america = ['Boliva' , 'Guatemala' , 'Venezuela' , 'Mexico' , 'Argentinia']
africa = [ 'Guinea Bissau' , 'Egypt' , 'Zaire' , 'Kenya']
def find_continent(x):
    if x in asia:
        return 'Asia'
    elif x in south_america:
        return 'South America'
    elif x in africa:
        return 'Africa'
    else:
        return 'Unknown'
df['Continent'] = df['Origin'].apply(find_continent)
df

Out[35]:
       Origin  X      Continent
0   Guatemala  x  South America
1       China  x           Asia
2       Kenya  x         Africa
3   Venezuela  x  South America
4  Bangladesh  x           Asia

Or if you have a much larger df then you can just make successive calls using isin and mask the rows using loc:

In [38]:
df.loc[df['Origin'].isin(asia),'Continent'] = 'Asia'
df.loc[df['Origin'].isin(south_america),'Continent'] = 'South America'
df.loc[df['Origin'].isin(africa),'Continent'] = 'Africa'
df['Continent'] = df['Continent'].fillna('Unknown')
df

Out[38]:
       Origin  X      Continent
0   Guatemala  x  South America
1       China  x           Asia
2       Kenya  x         Africa
3   Venezuela  x  South America
4  Bangladesh  x           Asia

As to why your attempts didn't work:

if x['Origin']== 'Thailand' or 'Indonesia' or 'China' or 'Japan' or 'Bangladesh'

This returns True because or 'Indonesia' is always True so all rows get set to Asia.

You should change it to like this:

if x['Origin'] in ('Thailand' , 'Indonesia' , 'China' , 'Japan' , 'Bangladesh'):

See related: How do I test one variable against multiple values?

Using np.where would be fine but you're not masking the rows so you continuously overwrite the rows so only the last op persists.

Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562