Joining dataframes in Python

Question

For the following dataframes in Python:

 Yref = pd.read_csv(rootDir + 'data/trailerClassificationData/C'+str(2)+'_withinShotAggr_'+withinShotAggr+'_btwshotAggr_'+btwShotAggr+'.csv',sep=',')
 Y = pd.read_csv(rootDir + 'data/trailerClassificationData/C'+str(3)+'_withinShotAggr_'+withinShotAggr+'_btwshotAggr_'+btwShotAggr+'.csv',sep=',')

where Y and Yref are some target classification outputs:

Yref

   movieId Action Comedy Drama Horror
0  93797     1      0     1      0
1  25899     0      1     0      0
2  5673      0      1     1      0
3  86308     0      1     0      0
4  3577      0      0     1      0
5  3575      0      0     1      0
...
7100 rows × 5 columns

and similarly for Y

Y

   movieId Action Comedy Drama Horror
0  93797     1      0     1      0
1  1222      0      0     1      0
2  5673      0      1     1      0
3  86308     0      1     0      0
4  3577      0      0     1      0
5  3575      0      0     1      0

7136 rows × 5 columns

as it can be seen the two outputs have different number of rows. Therefore, the first question is how to join the two dataframes with on = 'movieId' and how='inner' ?

Yjoin = Yref.join(Y,how='inner',on='movieId')

gave me this error. columns overlap but no suffix specified. I managed to solve the first problem using:

  Yjoin = Yref.merge(Y,on='movieId',how='inner')
  Yjoin = Yres.ix[:,0:5]
  Yjoin.rename(columns={'Action_x':'Action','Comedy_x':'Comedy_x','Drama_x':'Drama','Horror_x':'Horror'}, inplace=True)

Once done, X is a dataframe similar to Y with similar rows but without the key 'movieId'.

   test1     test2     test3          test4         test5
0  0.038039  0.212623  4.052835e-02   5.210721e-02  0.004591
1  0.054539  0.257145  0.000000e+00   0.000000e+00  0.115421
2  0.002842  0.209085  1.114923e-02   3.844100e-02  0.024544
3  0.136707  0.377181  0.000000e+00   0.000000e+00  0.055199
....
7136 rows × 5 columns

I need to remove the deleted rows from Yjoin also from X so X will have the same length 7100*5. At the end of the day, Y and X will have the same number of rows 7100.

thanks for your comments

Sorry are you after merge? `res = Yref.merge(Y,how='inner',on='movieId')` — EdChum, Jan 18 '17 at 10:22
thanks. very similar to this but i do not need to merge the two tables on column-side as they are the same. I just want to remove they rows from `Y` that does not exist in `Yref` given the `movieId` as the key. — FlytoScience, Jan 18 '17 at 10:28
you mean like this: http://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe? — EdChum, Jan 18 '17 at 10:31
You should say what error it gives you. "an" is not very helpful. — Mikk, Jan 18 '17 at 10:52
Thanks for your comments, I updated my question. I solved the first problem, now I need to answer the second question. — FlytoScience, Jan 18 '17 at 11:10

Joining dataframes in Python

0 Answers0