1

I have two datasets that look like this:

          name  Longitude   Latitude      continent
0        Aruba -69.982677  12.520880  North America
1  Afghanistan  66.004734  33.835231           Asia
2       Angola  17.537368 -12.293361         Africa
3     Anguilla -63.064989  18.223959  North America
4      Albania  20.049834  41.142450         Europe

And another dataset looks like this:

          COUNTRY  GDP (BILLIONS) CODE
0     Afghanistan           21.71  AFG
1         Albania           13.40  ALB
2         Algeria          227.80  DZA
3  American Samoa            0.75  ASM
4         Andorra            4.80  AND

Here, columns name and COUNTRY contains the country names but not in the same order.

How to combine the second dataframe into first one and add the CODE columns to the first dataframe.

Required output:

          name  Longitude   Latitude      continent   CODE
0        Aruba -69.982677  12.520880  North America   NaN
1  Afghanistan  66.004734  33.835231           Asia   AFG
2       Angola  17.537368 -12.293361         Africa   NaN
3     Anguilla -63.064989  18.223959  North America   NaN
4      Albania  20.049834  41.142450         Europe   ALB

Attempt:

import numpy as np
import pandas as pd

df = pd.DataFrame({'name' : ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
          'Longitude' : [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
          'Latitude' : [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
          'continent' : ['North America','Asia','Africa','North America','Europe'] })
print(df)

df2 = pd.DataFrame({'COUNTRY' :  ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
          'GDP (BILLIONS)' : [21.71, 13.40, 227.80, 0.75, 4.80],
          'CODE' : ['AFG', 'ALB', 'DZA', 'ASM', 'AND']})
print(df2)


pd.merge(left=df, right=df2,left_on='name',right_on='COUNTRY')
# but this fails
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169

3 Answers3

2

By default, pd.merge uses how='inner', which uses the intersection of keys across your two dataframes. Here, you need how='left' to use keys only from the left dataframe:

res = pd.merge(df, df2, how='left', left_on='name', right_on='COUNTRY')
jpp
  • 159,742
  • 34
  • 281
  • 339
1

The merge performs an 'inner' merge or join by default, only keeping records that have a match on both the left and the right. You want an 'outer' join, keeping all records (there is also 'left' or 'right').

Example:

import pandas as pd

df1 = pd.DataFrame({
    'name': ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
    'Longitude': [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
    'Latitude': [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
    'continent': ['North America', 'Asia', 'Africa', 'North America', 'Europe']
})
print(df1)

df2 = pd.DataFrame({
    'COUNTRY': ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
    'GDP (BILLIONS)': [21.71, 13.40, 227.80, 0.75, 4.80],
    'CODE': ['AFG', 'ALB', 'DZA', 'ASM', 'AND']
})
print(df2)

# merge, using 'outer' to avoid losing records from either left or right
df3 = pd.merge(left=df1, right=df2, left_on='name', right_on='COUNTRY', how='outer')
# combining the columns used to match
df3['name'] = df3.apply(lambda row: row['name'] if not pd.isnull(row['name']) else row['COUNTRY'], axis=1)
# dropping the now spare column
df3 = df3.drop('COUNTRY', axis=1)
print(df3)
Grismar
  • 27,561
  • 4
  • 31
  • 54
1

Pandas have pd.merge [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html] function which uses inner join by default. Inner join basically takes only those values that are present in both the keys specified in either on or on left_on and right_on if the keys to merge on in both the dataframes are different.

Since, you require the CODE value to be added, following line of code could be used:

pd.merge(left=df, right=df2[['COUNTRY', 'CODE']], left_on='name', right_on='COUNTRY', how='left')

This gives the following output:

          name  Longitude   Latitude      continent      COUNTRY CODE
0        Aruba -69.982677  12.520880  North America          NaN  NaN
1  Afghanistan  66.004734  33.835231           Asia  Afghanistan  AFG
2       Angola  17.537368 -12.293361         Africa          NaN  NaN
3     Anguilla -63.064989  18.223959  North America          NaN  NaN
4      Albania  20.049834  41.142450         Europe      Albania  ALB

Following also gives the same result:

new_df = pd.merge(left=df1[['COUNTRY', 'CODE']], right=df, left_on='COUNTRY', right_on='name', how='right')

       COUNTRY CODE         name  Longitude   Latitude      continent
0  Afghanistan  AFG  Afghanistan  66.004734  33.835231           Asia
1      Albania  ALB      Albania  20.049834  41.142450         Europe
2          NaN  NaN        Aruba -69.982677  12.520880  North America
3          NaN  NaN       Angola  17.537368 -12.293361         Africa
4          NaN  NaN     Anguilla -63.064989  18.223959  North America
thelogicalkoan
  • 620
  • 1
  • 5
  • 13