Pandas merge produces duplicate entries

Question

Disclaimer: I'm not a new developer - but quite new to pandas/python, so this could be simple.

Situation: I want to "merge" 2 pandas dataframe based on a "key". For example, I have 2 datasets:

confirmed_cases ("Country/Region","Lat","Long") - 279 rows
death_cases("Country/Region","8/8/21") - 279 rows

I want to merge the above 2 dataframe into ONE dataframe, with 4 columns: "Country/Region", "Lat", "Long", "8/8/21"

However, the merge result always produce A LOT more rows than expected (1989 rows instead of 279 - and I have NO idea why it produces that number)

Perhaps the code below explains it better:

import pandas as pd
import numpy as np

# TEST CODE
confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')

confirmed_cases = confirmed_df[["Country/Region","Lat","Long"]]
death_cases = deaths_df[["Country/Region","8/8/21"]]

print("confirmed_cases - shape: {}", confirmed_cases.shape) # shape = (279,3)
print("death_cases - shape: {}", death_cases.shape) # shape = (279,2)

print("Any country IN death_cases & IN confirmed_cases")
print(death_cases[death_cases["Country/Region"].isin(confirmed_cases["Country/Region"])]["Country/Region"].shape) # 279

all_info = pd.merge(confirmed_cases, death_cases, on="Country/Region") # Merge 2 dataframe
print("all_info - shape: {}", all_info.shape) # shape = (1989,4) ??? Why 1989?

print(all_info.head())

As you can see, all 279 records of the first dataframe has the SAME "Country/Region" value with the 2nd dataframe. So no data mismatch here.

I also did try to match the type (convert both "Country/Region" column to str), but the result is the same.

What did I do wrong here?

You have duplicates in the key column you merge on in both frames. . . `death_cases[death_cases['Country/Region'].duplicated()]` and `confirmed_cases[confirmed_cases['Country/Region'].duplicated()]` — It_is_Chris, Aug 18 '21 at 13:58
Hi @It_is_Chris, thanks, indeed I have quite a few duplicates in the data - I haven't thought about that possibility. Thanks for the help! — Hoàng Long, Aug 18 '21 at 14:05
About the "Pandas Merging 101"... I don't think so. I did read through similar post - they show us how to do the merge - but not mention about this pitfall which produces so many duplicates... So I don't think it will help others having the same issues. — Hoàng Long, Aug 18 '21 at 14:06
this is a classic newbie SQL problem, you have duplicate keys on one of your joins causing a product. of both dataframes. — Umar.H, Aug 18 '21 at 14:07
Well... to be fair, in SQL we can very quickly see if a column is unique or PK in table structure. For dataframe, no such luck. For example in the above example, I have assumed that "Country/Region" is the key, as we have quite a bunch of N/A value in "Province". Turn out it's a bad assumption... but still. — Hoàng Long, Aug 18 '21 at 14:51
@HoàngLong I agree but it's now always the case, large DWH like Azure Synapse you don't have primarmy keys. — Umar.H, Aug 18 '21 at 15:09

score 1 · Answer 1 · answered Aug 18 '21 at 14:00

1

Your problem is that your data frame has duplicated values in the column 'Country/Region'. Therefore it's not a 1 to 1 merge.

confirmed_df['Country/Region'].nunique()

Returns: 195

answered Aug 18 '21 at 14:00

sergiomahi

964
2
8
21

Thanks a lot, I can't believe I miss something so simple! – Hoàng Long Aug 18 '21 at 14:02

score 0 · Answer 2 · answered Aug 18 '21 at 14:14

You need to keep the lat and long in deaths_df for the merge

# TEST CODE
confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')

confirmed_cases = confirmed_df[["Country/Region","Lat","Long"]]
death_cases = deaths_df[["Country/Region", "Lat", "Long", "8/8/21"]]

m = confirmed_cases.merge(death_cases, on=["Country/Region","Lat","Long"])

but there is an issue with Canada. For some reason there is more than one record on the same day for the same lat and long. You probably want to drop that

m[m.duplicated()]

   Country/Region  Lat  Long  8/8/21
43         Canada  0.0   0.0       1
44         Canada  0.0   0.0       0

Pandas merge produces duplicate entries

2 Answers2