0

I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.

In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.

Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.

I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.

I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.

INPUT DATA

ID  DateTime            Value
1   2018-01-01 15:00:00 100
1   2018-01-01 15:05:00 150
1   2018-01-01 15:10:00 125
2   2018-02-02 13:15:00 105
2   2018-02-02 13:20:00 110
2   2018-02-02 13:25:00 90
3   2019-03-03 05:05:00 180
3   2019-03-03 05:10:00 190
3   2019-03-03 05:15:00 185

DESIRED OUTPUT


Time Elapsed    Mean Value
0.0             128.3
5.0             150.0
10.0            133.3

2 Answers2

2

Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean

df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]: 
Time Elapsed
00:00:00    128.333333
00:05:00    150.000000
00:10:00    133.333333
Name: Value, dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame

First get the year, month and day for each DateTime since they are all changing in your data

df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year

print(df)
   ID            DateTime  Value  month  day  year
1   1 2018-01-01 15:00:00    100      1    1  2018
1   1 2018-01-01 15:05:00    150      1    1  2018
1   1 2018-01-01 15:10:00    125      1    1  2018
2   2 2018-02-02 13:15:00    105      2    2  2018
2   2 2018-02-02 13:20:00    110      2    2  2018
2   2 2018-02-02 13:25:00     90      2    2  2018
3   3 2019-03-03 05:05:00    180      3    3  2019
3   3 2019-03-03 05:10:00    190      3    3  2019
3   3 2019-03-03 05:15:00    185      3    3  2019

Then append a sequential DateTime counter column (per this SO post)

  • the counter is computed within (1) each year, (2) then each month and then (3) each day
  • since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5

print(df)
   ID            DateTime  Value  month  day  year  cumulative_record
1   1 2018-01-01 15:00:00    100      1    1  2018                  5
1   1 2018-01-01 15:05:00    150      1    1  2018                 10
1   1 2018-01-01 15:10:00    125      1    1  2018                 15
2   2 2018-02-02 13:15:00    105      2    2  2018                  5
2   2 2018-02-02 13:20:00    110      2    2  2018                 10
2   2 2018-02-02 13:25:00     90      2    2  2018                 15
3   3 2019-03-03 05:05:00    180      3    3  2019                  5
3   3 2019-03-03 05:10:00    190      3    3  2019                 10
3   3 2019-03-03 05:15:00    185      3    3  2019                 15

Perform the groupby over the newly appended counter column

dfg = df.groupby('Time Elapsed')['Value'].mean()

print(dfg)
Time Elapsed
5     128.333333
10    150.000000
15    133.333333
Name: Value, dtype: float64
edesz
  • 11,756
  • 22
  • 75
  • 123
  • Thanks very much - this is very helpful. One problem is that there might be different IDs which have entries on the same day, so the cumulative record groups these IDs together. i.e. instead of the desired output of "5, 10, 15" and "5, 10, 15" it will be "5, 10, 15, 20, 25, 30." Is there a way of using the groupby function to separate these out? – noobpython Apr 22 '19 at 06:29
  • IIUC, you should try `df['Time Elapsed'] = df.groupby(['ID', 'year', 'month', 'day']).cumcount() + 1` in order to account for variable `ID`s. The rest of the code could be left unchanged. – edesz Apr 22 '19 at 14:19
  • Also, I should add that the answer posted by @Wen-Ben [here](https://stackoverflow.com/a/55788649/4057186) could be used too since that solution accounts for difference in `ID`s. – edesz Apr 22 '19 at 15:45