I am trying to make a left join over two dataframe in python / pandas. I can't have it done :-( Here are the tests I have coded to achieve what I want :
print "nb of common indexes=%s"%(len(set(df1.index).union(set(df2.index))))
print "nb of distinct value on specific col to merge : df1 = ", df1[col_df1].value_counts().size
print "nb of distinct value on specific col to merge : df2 = ", df2[col_df2].value_counts().size
print "Expected size = df1 = ", df1[col_df1].value_counts().size
print "df1= ", df1.shape
print "df2= ", df2.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left')
print "new_df / left = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right')
print "new_df / right = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True)
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape
And the result =
nb of common indexes=1147
nb of distinct value on specific col to merge : df1 = 848
nb of distinct value on specific col to merge : df2 = 1147
Expected size = df1 = 848
df1= (9999, 53)
df2= (1867, 19)
new_df / left = (18582, 72)
new_df / right = (18913, 72)
new_df / right index = (18913, 72)
new_df / left index = (18913, 72)
I can't find the right combinaison to have back only the 848 lines I have on my left df... Does anyone see the mistake ?
EDIT :
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True)
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape
GIVES:
new_df / right index = (18582, 72)
new_df / left index = (18582, 72)
new_df / right @ left index = (18582, 72)
Still not the wanted values :-(