0

I am trying to make a left join over two dataframe in python / pandas. I can't have it done :-( Here are the tests I have coded to achieve what I want :

print "nb of common indexes=%s"%(len(set(df1.index).union(set(df2.index))))
print "nb of distinct value on specific col to merge : df1 = ", df1[col_df1].value_counts().size
print "nb of distinct value on specific col to merge : df2 = ", df2[col_df2].value_counts().size
print "Expected size = df1 = ", df1[col_df1].value_counts().size
print "df1= ", df1.shape
print "df2= ", df2.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left')
print "new_df / left = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right')
print "new_df / right = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True) 
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='right', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape

And the result =

nb of common indexes=1147
nb of distinct value on specific col to merge : df1 =  848
nb of distinct value on specific col to merge : df2 =  1147
Expected size = df1 =  848
df1=  (9999, 53)
df2=  (1867, 19)
new_df / left =  (18582, 72)
new_df / right =  (18913, 72)
new_df / right index =  (18913, 72)
new_df / left index =  (18913, 72)

I can't find the right combinaison to have back only the 848 lines I have on my left df... Does anyone see the mistake ?

EDIT :

new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', right_index=True)
print "new_df / right index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True)
print "new_df / left index = ", new_df.shape
new_df = pd.merge(df1, df2, left_on=col_df1, right_on=col_df2, how='left', left_index=True, right_index=True)
print "new_df / right @ left index = ", new_df.shape

GIVES:

new_df / right index =  (18582, 72)
new_df / left index =  (18582, 72)
new_df / right @ left index =  (18582, 72)

Still not the wanted values :-(

Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
  • Why don't you reduce the dfs to a small enough size to post them in the question. What do they look like? What is the join column. Personally, I have no idea what's happening with your example. – Ami Tavory May 10 '15 at 16:35
  • Sample input and output would help immensely! – fixxxer May 10 '15 at 16:36
  • I added a one-line example of the df, if it can be useful... – Romain Jouin May 10 '15 at 16:52
  • Try to do a `df.head()` on your dataframe variable and paste the output here. The current structure doesn't help much. – fixxxer May 10 '15 at 17:07

1 Answers1

0

There is no bug on this code : according to the documentation (actually O'reilly 'python for data analysis', p179/180) a many-to-many left join make a cross-product over the lines.

(I have been programming to much in pig lastly, where the behaviour is not the same : only left-side-lines are kept. Or maybe I am mixed up :-( )

Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
  • I have to say I really liked this post : http://stackoverflow.com/questions/28345724/pandas-merge-on-aggregated-columns – Romain Jouin May 10 '15 at 17:34