0

Suppose I have a dataframe as follows:

df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32], 
               "event":[0,0,1,0,0,1,1,1,0,0], 
               "datetime":['05:29:54','05:32:04','05:32:08',
                           '15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
                           '09:29:12', '09:29:25'] })

enter image description here

I would like to handle the repetitive lines across the first column (user) to reach the following.

enter image description here

In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.

P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.

Amin Kiany
  • 722
  • 8
  • 17

2 Answers2

3

You want to groupby and aggregate

df.groupby('user').agg({'event': 'max', 
                        'datetime': lambda s: pd.to_timedelta(s).mean()})

If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg

You can use str to represent the way you intend

df.groupby('user').agg({'event': 'max', 
                        'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • Thanks @RafaelC for the quick answer. I liked the way that you did average on the datetime column. However, if I have the datetime columns as e.g., '2018-09-01 05:29:54', then the average method return funny result for the day-month-year! although it is correct for the hour-min-sec as you have shown it. Do you have any comment on it? – Amin Kiany Nov 06 '18 at 19:04
  • @AminKiany You can just convert back to python time delta and use `str` on it. Will edit in a bit – rafaelc Nov 06 '18 at 19:39
1

You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:

df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)

df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
   user  event  datetime
0    11      1  05:31:22
1    21      1  15:36:18
2    32      0  09:29:18
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252