1

OK, I'm super confused here.

I had a look at this post (Time difference within group by objects in Python Pandas) and others and tried to follow them. Still I'm getting wonky answers.

What I would like to do is calculate the time in minutes since the previous row within each person's ID, and the first reading for each person should also = 30 minutes.

Here's what I tried:

import pandas as pd

ids = [1,1,1,2,2,2,3,3,3,4]
dates = ["12/25/2021 7:47:01", "6/23/2021 20:02:32", "2/24/2021 18:57:40", "2/26/2021 20:17", "3/7/2021 11:23:01", "3/11/2021 17:48:55", "3/12/2021 19:22:32", "3/25/2021 11:11:11", "4/4/2021 14:53:40",  "4/6/2021 13:07:07"]

df = pd.DataFrame({'ids':ids,
                   'dates':dates})

df['dates'] = pd.to_datetime(df['dates'])

df = df.sort_values(by = ['ids', 'dates'])

print(df)

   ids               dates
2    1 2021-02-24 18:57:40
1    1 2021-06-23 20:02:32
0    1 2021-12-25 07:47:01
3    2 2021-02-26 20:17:00
4    2 2021-03-07 11:23:01
5    2 2021-03-11 17:48:55
6    3 2021-03-12 19:22:32
7    3 2021-03-25 11:11:11
8    3 2021-04-04 14:53:40
9    4 2021-04-06 13:07:07

df = df.assign(
    timediff=df.groupby(['ids']).dates.diff().dt.seconds.div(60).fillna(30))

print(df)

   ids               dates    timediff
2    1 2021-02-24 18:57:40   30.000000
1    1 2021-06-23 20:02:32   64.866667
0    1 2021-12-25 07:47:01  704.483333
3    2 2021-02-26 20:17:00   30.000000
4    2 2021-03-07 11:23:01  906.016667
5    2 2021-03-11 17:48:55  385.900000
6    3 2021-03-12 19:22:32   30.000000
7    3 2021-03-25 11:11:11  948.650000
8    3 2021-04-04 14:53:40  222.483333
9    4 2021-04-06 13:07:07   30.000000

As you can see, I'm getting weird results. Here is something like what I would expect (just ballparked numbers):

   ids               dates    timediff
2    1 2021-02-24 18:57:40   30.000000
1    1 2021-06-23 20:02:32   171360.000000
0    1 2021-12-25 07:47:01  262800.000000
3    2 2021-02-26 20:17:00   30.000000
4    2 2021-03-07 11:23:01  43800.0000
5    2 2021-03-11 17:48:55  34320.00000
6    3 2021-03-12 19:22:32   30.000000
7    3 2021-03-25 11:11:11  18720.00000
8    3 2021-04-04 14:53:40  14400.00000
9    4 2021-04-06 13:07:07   30.000000

What's going on here and how could I fix it? Thanks in advance.

Dr Wampa
  • 443
  • 3
  • 8
  • 2
    The key to fixing your unexpected output is to replace `.dt.seconds` with `.dt.total_seconds()`: https://stackoverflow.com/questions/5522031/convert-timedelta-to-total-seconds – Peter Leimbigler Feb 13 '22 at 01:10

1 Answers1

1

You need to use total_seconds() instead of seconds. The latter only includes the time within a day, but your time differences are larger than one day.

Ref: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.total_seconds.html vs https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.seconds.html

John Zwinck
  • 239,568
  • 38
  • 324
  • 436