Manipulating duplicate rows across a subset of columns in dataframe pandas

Question

Suppose I have a dataframe as follows:

df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32], 
               "event":[0,0,1,0,0,1,1,1,0,0], 
               "datetime":['05:29:54','05:32:04','05:32:08',
                           '15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
                           '09:29:12', '09:29:25'] })

I would like to handle the repetitive lines across the first column (user) to reach the following.

In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.

P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.

rafaelc · Accepted Answer · 2018-11-06T19:40:10.240

3

You want to groupby and aggregate

df.groupby('user').agg({'event': 'max', 
                        'datetime': lambda s: pd.to_timedelta(s).mean()})

If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg

You can use str to represent the way you intend

df.groupby('user').agg({'event': 'max', 
                        'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})

edited Nov 06 '18 at 19:40

answered Nov 06 '18 at 15:07

rafaelc

57,686
15
58
82

Thanks @RafaelC for the quick answer. I liked the way that you did average on the datetime column. However, if I have the datetime columns as e.g., '2018-09-01 05:29:54', then the average method return funny result for the day-month-year! although it is correct for the hour-min-sec as you have shown it. Do you have any comment on it? – Amin Kiany Nov 06 '18 at 19:04
@AminKiany You can just convert back to python time delta and use `str` on it. Will edit in a bit – rafaelc Nov 06 '18 at 19:39

score 1 · Answer 2 · answered Nov 06 '18 at 15:07

You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:

df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)

df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
   user  event  datetime
0    11      1  05:31:22
1    21      1  15:36:18
2    32      0  09:29:18

Thanks! for the quick reply :) – Amin Kiany Nov 06 '18 at 20:34 — Amin Kiany, Nov 06 '18 at 20:34

Manipulating duplicate rows across a subset of columns in dataframe pandas

2 Answers2