5

I'm looking for help with this simultaneous group-by / row-on-row difference problem in Pandas. The problem is exactly as stated here for R: How to calculate time difference between datetimes, for each group (student-contract)?

I have data like this:

#   USER_ID CONTRACT_REF SUBMISSION_DATE  
1        1        A        20/6 01:00   
2        1        A        20/6 02:00   
3        1        B        20/6 03:00   
4        4        A        20/6 04:00   
5        5        A        20/6 05:00   
6        5        B        20/6 06:00   
7        7        A        20/6 07:00   
8        7        B        20/6 08:00   
9        7        B        20/6 09:30   
10       7        B        20/6 10:00   

I want to calculate the time difference from the previous submission for each unique USER_ID - CONTRACT_REF pair.

Note: each USER_ID - CONTRACT_REF pair has to have a zero (or null) for its first appearance.

So the output should look as follows:

#   USER_ID CONTRACT_REF SUBMISSION_DATE   TIME_DIFFERENCE
1        1        A        20/6 01:00             0
2        1        A        20/6 02:00             1
3        1        B        20/6 03:00             0
4        4        A        20/6 04:00             0
5        5        A        20/6 05:00             0          
6        5        B        20/6 06:00             0
7        7        A        20/6 07:00             0
8        7        A        20/6 08:00             1
9        7        A        20/6 09:30             1.5
10       7        B        20/6 10:00             0

I'm currently moving to Pandas from R, and while I find the syntax refreshing, I'm a bit stumped when it comes to complex functions on dataframes.

Thanks in advance for any tips!

Community
  • 1
  • 1
Harry Palmer
  • 468
  • 1
  • 6
  • 17

1 Answers1

10

[Note: your data doesn't seem to match your desired output; there are no CONTRACT_REF Cs in the second, and even in your output, I don't see why the 5, B row is 1 and not 0. I'm assuming that these are mistakes on your part. Since you didn't comment, I'm going to use the data from the output, because it leads to a more interesting column.]

I might do something like

df["SUBMISSION_DATE"] = pd.to_datetime(df["SUBMISSION_DATE"],dayfirst=True)

gs = df.groupby(["USER_ID", "CONTRACT_REF"])["SUBMISSION_DATE"]
df["TIME_DIFF"] = gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)

which produces

>>> df
    #  USER_ID CONTRACT_REF     SUBMISSION_DATE  TIME_DIFF
0   1        1            A 2014-06-20 01:00:00        0.0
1   2        1            A 2014-06-20 02:00:00        1.0
2   3        1            B 2014-06-20 03:00:00        0.0
3   4        4            A 2014-06-20 04:00:00        0.0
4   5        5            A 2014-06-20 05:00:00        0.0
5   6        5            B 2014-06-20 06:00:00        0.0
6   7        7            A 2014-06-20 07:00:00        0.0
7   8        7            A 2014-06-20 08:00:00        1.0
8   9        7            A 2014-06-20 09:30:00        1.5
9  10        7            B 2014-06-20 10:00:00        0.0

[10 rows x 5 columns]

Some explanation: starting from a dataframe like

>>> df
    #  USER_ID CONTRACT_REF SUBMISSION_DATE
0   1        1            A      20/6 01:00
1   2        1            A      20/6 02:00
2   3        1            B      20/6 03:00
3   4        4            A      20/6 04:00
4   5        5            A      20/6 05:00
5   6        5            B      20/6 06:00
6   7        7            A      20/6 07:00
7   8        7            A      20/6 08:00
8   9        7            A      20/6 09:30
9  10        7            B      20/6 10:00

[10 rows x 4 columns]

We want to turn the SUBMISSION_DATE column from strings to real date objects:

>>> df["SUBMISSION_DATE"] = pd.to_datetime(df["SUBMISSION_DATE"],dayfirst=True)
>>> df
    #  USER_ID CONTRACT_REF     SUBMISSION_DATE
0   1        1            A 2014-06-20 01:00:00
1   2        1            A 2014-06-20 02:00:00
2   3        1            B 2014-06-20 03:00:00
3   4        4            A 2014-06-20 04:00:00
4   5        5            A 2014-06-20 05:00:00
5   6        5            B 2014-06-20 06:00:00
6   7        7            A 2014-06-20 07:00:00
7   8        7            A 2014-06-20 08:00:00
8   9        7            A 2014-06-20 09:30:00
9  10        7            B 2014-06-20 10:00:00

[10 rows x 4 columns]

Then we can group by USER_ID and CONTRACT_REF, and select the SUBMISSION_DATE column:

>>> gs = df.groupby(["USER_ID", "CONTRACT_REF"])["SUBMISSION_DATE"]
>>> gs
<pandas.core.groupby.SeriesGroupBy object at 0xa7af08c>

Then we can take the difference of each group:

>>> gs.diff()
0        NaT
1   01:00:00
2        NaT
3        NaT
4        NaT
5        NaT
6        NaT
7   01:00:00
8   01:30:00
9        NaT
dtype: timedelta64[ns]

NaT, Not-a-Time, is the temporal equivalent of NaN. We can fill these with 0:

>>> gs.diff().fillna(0)
0   00:00:00
1   01:00:00
2   00:00:00
3   00:00:00
4   00:00:00
5   00:00:00
6   00:00:00
7   01:00:00
8   01:30:00
9   00:00:00
dtype: timedelta64[ns]

And since you want things to be measured in hours, we can divide by a timedelta of 1 hour:

>>> gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
0    0.0
1    1.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    1.0
8    1.5
9    0.0
dtype: float64

Assign this to the frame:

>>> df["TIME_DIFF"] = gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)

And we're done:

>>> df
    #  USER_ID CONTRACT_REF     SUBMISSION_DATE  TIME_DIFF
0   1        1            A 2014-06-20 01:00:00        0.0
1   2        1            A 2014-06-20 02:00:00        1.0
2   3        1            B 2014-06-20 03:00:00        0.0
3   4        4            A 2014-06-20 04:00:00        0.0
4   5        5            A 2014-06-20 05:00:00        0.0
5   6        5            B 2014-06-20 06:00:00        0.0
6   7        7            A 2014-06-20 07:00:00        0.0
7   8        7            A 2014-06-20 08:00:00        1.0
8   9        7            A 2014-06-20 09:30:00        1.5
9  10        7            B 2014-06-20 10:00:00        0.0

[10 rows x 5 columns]
DSM
  • 342,061
  • 65
  • 592
  • 494
  • FYI, you can also divide by ``np.timedelta64(1,'H')``, or do ``astype('timedelta64[H]')`` instead of the division, in 0.13, to provide a conversion on the timedelta64 to whatever float output format you want) – Jeff Feb 01 '14 at 18:23
  • @Jeff: `np.timedelta64(1, 'H')` would work, but `astype('timedelta64[h]')` -- I needed a lowercase `h` -- seems to truncate, so I wouldn't get 1.5 out. – DSM Feb 01 '14 at 18:34
  • Wow, didn't know about these groupby objects, looks very very useful! Thanks DSM - and very helpful explanation. And you were right about the mistakes in the examples. – Harry Palmer Feb 01 '14 at 18:34
  • @HarryPalmer: Definitely worth reading through the [`groupby`](http://pandas.pydata.org/pandas-docs/stable/groupby.html) section of the docs-- used wisely, or even not-so-wisely ;^) -- `groupby` can make your life much easier. – DSM Feb 01 '14 at 18:36
  • yep the astyping truncates but dividing gives u the exact value - feature! http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas-conversions – Jeff Feb 01 '14 at 20:02
  • To avoid warnings as .fillna(0) won't actually worlk, `df_rolling['time_diff'] = gs \ .diff().fillna(pd.Timedelta(seconds=0)) / pd.Timedelta(hours=1)` – CheTesta Apr 18 '20 at 13:17