To make this as clear as possible I started with a simple example. I created two random dataframes
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature3': ['K', 'M', 'O', 'Q', 'S'],
'Feature4': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature3', 'Feature4'])
And if I apply this df_merge = pd.merge(df1, df2, on = 'id', how='outer')
or df_merge = df1.merge(df2,how='left', left_on='id', right_on='id')
I get the desired output of
Now I am trying to apply the same technique with two large datasets that have the same number of rows. All I want to do is join the columns together into one large dataframe. The length of each dataframe is 512573
But when I apply
df_merge = orig_data_updated.merge(demographic_data1,how='left', left_on='Location+Type', right_on='Location+Type')
Then the length magically becomes 3596301
which is simply not possible. My question is simple. How do I do a left join on two dataframes such that the number of rows is the same and I just join the columns together?