For the following dataframes in Python:
Yref = pd.read_csv(rootDir + 'data/trailerClassificationData/C'+str(2)+'_withinShotAggr_'+withinShotAggr+'_btwshotAggr_'+btwShotAggr+'.csv',sep=',')
Y = pd.read_csv(rootDir + 'data/trailerClassificationData/C'+str(3)+'_withinShotAggr_'+withinShotAggr+'_btwshotAggr_'+btwShotAggr+'.csv',sep=',')
where Y
and Yref
are some target classification outputs:
Yref
movieId Action Comedy Drama Horror
0 93797 1 0 1 0
1 25899 0 1 0 0
2 5673 0 1 1 0
3 86308 0 1 0 0
4 3577 0 0 1 0
5 3575 0 0 1 0
...
7100 rows × 5 columns
and similarly for Y
Y
movieId Action Comedy Drama Horror
0 93797 1 0 1 0
1 1222 0 0 1 0
2 5673 0 1 1 0
3 86308 0 1 0 0
4 3577 0 0 1 0
5 3575 0 0 1 0
7136 rows × 5 columns
as it can be seen the two outputs have different number of rows. Therefore, the first question is how to join the two dataframes with on = 'movieId' and how='inner' ?
Yjoin = Yref.join(Y,how='inner',on='movieId')
gave me this error. columns overlap but no suffix specified
. I managed to solve the first problem using:
Yjoin = Yref.merge(Y,on='movieId',how='inner')
Yjoin = Yres.ix[:,0:5]
Yjoin.rename(columns={'Action_x':'Action','Comedy_x':'Comedy_x','Drama_x':'Drama','Horror_x':'Horror'}, inplace=True)
Once done, X
is a dataframe similar to Y
with similar rows but without the key 'movieId'.
test1 test2 test3 test4 test5
0 0.038039 0.212623 4.052835e-02 5.210721e-02 0.004591
1 0.054539 0.257145 0.000000e+00 0.000000e+00 0.115421
2 0.002842 0.209085 1.114923e-02 3.844100e-02 0.024544
3 0.136707 0.377181 0.000000e+00 0.000000e+00 0.055199
....
7136 rows × 5 columns
I need to remove the deleted rows from Yjoin
also from X so X will have the same length 7100*5. At the end of the day, Y and X will have the same number of rows 7100.
thanks for your comments