pandas: Calculated column based on values in one column

Question

I have columns like this in a csv file (I load it using read_csv('fileA.csv', parse_dates=['ProcessA_Timestamp']))

Item    ProcessA_Timestamp
'A'    2014-06-08 03:32:20
'B'    2014-06-08 03:32:20
'A'    2014-06-08 03:33:19
'C'    2014-06-08 03:33:20
'B'    2014-06-08 03:33:40
'D'    2014-06-08 03:38:20

How would I go about creating a column called ProcessA_ProcessingTime, which would be the time difference between last time an item occurs in the table - first time it occurs in the table.

Similarly, I have other data frames (which I'm not sure if they should be merged into one dataframe).. that have their own Process*_Timestamps.

Finally, I need to create a table, where the data is like this:

Item ProcessA_ProcessingTime ProcessB_ProcessingTime ... ProcessX_ProcessingTime
'A'                 00:00:59                  ...
'B'                 00:01:21
'C'         NOT FINISHED YET
'D'         NOT FINISHED YET

Make two columns, a `Process` column for storing the name of the column, and a `ProcessingTime` column for storing that. Then you can do `df.groupby(['Item', 'Process'])`. — U2EF1, Jun 09 '14 at 06:55

Ankush Shah · Accepted Answer · 2014-06-09T20:29:14.290

1

You can use the pandas groupby-apply combo. Group the dataframe by "Item" and apply a function that calculates the process time. Something like:

import pandas as pd

def calc_process_time(row):
    ts = row["ProcessA_Timestamp].values
    if len(ts) == 1:
        return pd.NaT
    else:
        return ts[-1] - ts[0] #last time - first time

df.groupby("Item").apply(calc_process_time)

edited Jun 09 '14 at 20:29

answered Jun 08 '14 at 08:38

Ankush Shah

938
8
13

I know you've answered my question precisely, but do you think, for the sake of datatype consistency, `"NOT FINISHED YET"` should be replaced by `NaN`? – Lelouch Lamperouge Jun 08 '14 at 19:36
No, it should be replaced by `pd.NaT` (not a time), or simply be excluded from the data. Note that `df.groupby('Item').diff()` does this already. – U2EF1 Jun 09 '14 at 06:53

pandas: Calculated column based on values in one column

1 Answers1

Linked