pandas, subtract dataframe from another, when column match

Question

I have a dataframe (movielens dataset)

(Pdb) self.data_train.head()
       userId  movieId  rating   timestamp
65414     466      608     4.0   945139883
79720     547     6218     4.0  1089518106
63354     457     4007     3.5  1471383787
29923     213    59333     2.5  1462636955
63651     457   102194     2.5  1471383710

I found mean of each user's rating

 user_mean = self.data_train['rating'].groupby(self.data_train['userId']).mean()
(Pdb) user_mean.head()
userId
1    2.527778
2    3.426471
3    3.588889
4    4.363158
5    3.908602

I want to subtract this mean value from the first dataframe for the matching user.

Is there a way of doing it without a explicit for loop?

score 0 · Accepted Answer · answered Mar 16 '19 at 15:21

0

I think you need GroupBy.transform with mean for Serieswith same size like original DataFrame, so possible subtract column by Series.sub:

s = self.data_train.groupby('userId')['rating'].transform('mean')
self.data_train['new'] = self.data_train['rating'].sub(s)

Sample: Changed data in userId for better sample

print (data_train)
       userId  movieId  rating   timestamp
65414     466      608     4.0   945139883
79720     466     6218     4.0  1089518106
63354     457     4007     3.5  1471383787
29923     466    59333     2.5  1462636955
63651     457   102194     2.5  1471383710

s = data_train.groupby('userId')['rating'].transform('mean')
print (s)
65414    3.5
79720    3.5
63354    3.0
29923    3.5
63651    3.0
Name: rating, dtype: float64

data_train['new'] = data_train['rating'].sub(s)
print (data_train)
       userId  movieId  rating   timestamp  new
65414     466      608     4.0   945139883  0.5
79720     466     6218     4.0  1089518106  0.5
63354     457     4007     3.5  1471383787  0.5
29923     466    59333     2.5  1462636955 -1.0
63651     457   102194     2.5  1471383710 -0.5

answered Mar 16 '19 at 15:21

jezrael

822,522
95
1,334
1,252

can you elaborate each step of `s = data_train.groupby('userId')['rating'].transform('mean')` line? I can't dicipher the documentation. or I can't see what they are by printing `data_train.groupby('userId')` `data_train.groupby('userId')['rating']` .... Thanks anyway – eugene Mar 16 '19 at 15:36
1

@eugene - So import part of code is [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#dataframe-column-selection-in-groupby) - `df['C'].groupby(df['A'])` (more verbose) is same like `df.groupby('A')['C']` (common use) – jezrael Mar 16 '19 at 15:39
1

@eugene - difference between `transform` vs `aggregation` is possible check [this](https://stackoverflow.com/q/40957932/2901002) – jezrael Mar 16 '19 at 15:43
1

wow, much approachable document! thanks. wish pandas' doc is excutable as jupyter – eugene Mar 16 '19 at 15:48

pandas, subtract dataframe from another, when column match

1 Answers1