Aggregate Data Based on Rows Python

Question

I have a data set that looks like this:

      Date          | ID |  Task |   Description
2016-01-06 00:00:00 | 1  |  010  |   This is text
2016-01-06 00:10:00 | 1  |  020  |   This is text
2016-01-06 00:20:00 | 1  |  010  |   This is text
2016-01-06 01:00:00 | 1  |  020  |   This is text
2016-01-06 01:10:00 | 1  |  030  |   This is text
2016-02-06 00:00:00 | 2  |  010  |   This is text
2016-02-06 00:10:00 | 2  |  020  |   This is text
2016-02-06 00:20:00 | 2  |  010  |   This is text
2016-02-06 01:00:00 | 2  |  020  |   This is text
2016-02-06 01:01:00 | 2  |  030  |   This is text

Task 020usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020, if it comes before any other Task it means that it has stopped.

I need to group by Task calculating the average duration, total sum and count of each type of task in each ID, so I am looking for something like this:

ID  | Task | Average | Sum | Count
1   |  010 |   25    | 50  |  2 
1   |  020 |   10    | 20  |  2
etc |  etc |   etc   | etc |  etc

There are more IDs but I only care about 010 and 020, so whatever number is returned from them is acceptable.

Can someone help me on how to do this in Python?

That is one problem I have, I can't really know that duration. However, since I only need `010` and `020`, I wouldn't need to know it. — Max Payne, Apr 03 '17 at 11:26

score 0 · Answer 1 · edited May 23 '17 at 11:54

0

I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID

df['count'] = df.groupby(['ID','Task']).size()

will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.

See here for more details.

edited May 23 '17 at 11:54

Community

1
1

answered Apr 03 '17 at 11:36

philshem

24,761
8
61
127

Thanks for your response, just one question though. I need to calculate the difference in time in the `Date` column, would this mean that I need to do this before hand somehow? – Max Payne Apr 03 '17 at 11:46
you then need to use `.shift()` to create a lead/lag function (diff of t1 and t2) http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html – philshem Apr 03 '17 at 11:52

jezrael · Answer 2 · 2017-04-03T12:18:09.063

It seems you need agg with groupby, but in sample not numeric column so col was added:

print (df)
                  Date  ID Task   Description      col
0  2016-01-06 00:00:00   1  010  This is text        1
1  2016-01-06 00:10:00   1  020  This is text        2
2  2016-01-06 00:20:00   1  010  This is text        6
3  2016-01-06 01:00:00   1  020  This is text        1
4  2016-01-06 01:10:00   1  030  This is text        3
5  2016-02-06 00:00:00   2  010  This is text        1
6  2016-02-06 00:10:00   2  020  This is text        8
7  2016-02-06 00:20:00   2  010  This is text        9
8  2016-02-06 01:00:00   2  020  This is text        1

df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
   ID Task  sum  size  mean
0   1  010    7     2   3.5
1   1  020    3     2   1.5
2   1  030    3     1   3.0
3   2  010   10     2   5.0
4   2  020    9     2   4.5

If need aggreagte datetime, id is a bit complicated, because need timedeltas:

df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
       .agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
   ID Task             sum  size            mean
0   1  010 00:00:02.904078     2 00:00:01.452039
1   1  020 00:00:02.904081     2 00:00:01.452040
2   1  030 00:00:01.452042     1 00:00:01.452042
3   2  010 00:00:02.909434     2 00:00:01.454717
4   2  020 00:00:02.909437     2 00:00:01.454718

For finding difference in column date:

print (df.Date.dtypes)
object

#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0                NaT
1    0 days 00:10:00
2    0 days 00:10:00
3    0 days 00:40:00
4    0 days 00:10:00
5   30 days 22:50:00
6    0 days 00:10:00
7    0 days 00:10:00
8    0 days 00:40:00
9    0 days 00:01:00
Name: Date, dtype: timedelta64[ns]

So would this mean that I need to calculate somehow the difference in the time in the `Date` column and then do the `groupby`? — Max Payne, Apr 03 '17 at 11:44
Maybe yes, the best is create sample data with desired output. — jezrael, Apr 03 '17 at 11:46
You can get difference by `diff` function, please check my answer. If need something else, please let me know. — jezrael, Apr 03 '17 at 12:18

Aggregate Data Based on Rows Python

2 Answers2