Disclaimer: I'm not a new developer - but quite new to pandas/python, so this could be simple.
Situation: I want to "merge" 2 pandas dataframe based on a "key". For example, I have 2 datasets:
- confirmed_cases ("Country/Region","Lat","Long") - 279 rows
- death_cases("Country/Region","8/8/21") - 279 rows
I want to merge the above 2 dataframe into ONE dataframe, with 4 columns: "Country/Region", "Lat", "Long", "8/8/21"
However, the merge result always produce A LOT more rows than expected (1989 rows instead of 279 - and I have NO idea why it produces that number)
Perhaps the code below explains it better:
import pandas as pd
import numpy as np
# TEST CODE
confirmed_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
confirmed_cases = confirmed_df[["Country/Region","Lat","Long"]]
death_cases = deaths_df[["Country/Region","8/8/21"]]
print("confirmed_cases - shape: {}", confirmed_cases.shape) # shape = (279,3)
print("death_cases - shape: {}", death_cases.shape) # shape = (279,2)
print("Any country IN death_cases & IN confirmed_cases")
print(death_cases[death_cases["Country/Region"].isin(confirmed_cases["Country/Region"])]["Country/Region"].shape) # 279
all_info = pd.merge(confirmed_cases, death_cases, on="Country/Region") # Merge 2 dataframe
print("all_info - shape: {}", all_info.shape) # shape = (1989,4) ??? Why 1989?
print(all_info.head())
As you can see, all 279 records of the first dataframe has the SAME "Country/Region" value with the 2nd dataframe. So no data mismatch here.
I also did try to match the type (convert both "Country/Region" column to str
), but the result is the same.
What did I do wrong here?