-1

I have a DataFrame with TIME_IN and TIME_OUT columns (datetimes up to the second). I want a new DF w/ the sum of the duration (TIME_OUT - TIME_IN) by date. Each day runs from 5AM - 5AM, so I adjust for that as well.

This is part of a mini-project to teach myself Pandas, but my next application will be much more involved, so EFFICIENCY is key for me.

I've tried two approaches (resample and groupby), but both have the same issue: the timedelta DURATION column is not summing.

df["DATE"] = pd.to_datetime((df["TIME_IN"]                                    
             - dt.timedelta(hours=hrEnd)).dt.date)
df["DURATION"] = df["TIME_OUT"] - df["TIME_IN"]

dfGroupBy= df.groupby("DATE").sum()

df.setindex("DATE", inplace=True)
dfResample = df.resample("D").sum()

It seems Pandas does not sum timedelta64 type columns the way I attempted, so the returned DataFrame simply does not include the DURATION column. What is the most efficient way to do this?

EDIT: Here is a sample of the raw data right in df: enter image description here

Steven H
  • 87
  • 2
  • 7
  • Please provide sample data. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples?noredirect=1 – Michael Gardner Oct 11 '19 at 23:11
  • @MichaelGardner Included a screenshot of the data within df above. This is what df looks like just before any of the code included above. Thanks! – Steven H Oct 11 '19 at 23:23
  • pictures are not useful for us to create the DF & test it out. You have to paste the data as text or provide the csv file via some link. – moys Oct 12 '19 at 02:46

2 Answers2

1

you can use agg function of grouped object to sum duration like below

import pandas as pd
import numpy as np

np.random.seed(10)

## Generate dummy data for testing
dt_range = pd.date_range("oct-12-2019", "oct-14-2019", freq="H")

arr = []
while len(arr)<10:
    i,j = np.random.choice(len(dt_range), 2)
    g = np.random.choice(4)
    if j>i:
        arr.append([g, dt_range[i], dt_range[j]])

df = pd.DataFrame(arr, columns=["group", "time_in", "time_out"])


## Solution
df["duration"] = df["time_out"] - df["time_in"]
df.groupby(df["time_in"].dt.date).agg({"duration":np.sum})

Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
1

I think your code works as expected?

df['TIME_IN'] = pd.to_datetime(df['TIME_IN'])
df['TIME_OUT'] = pd.to_datetime(df['TIME_OUT'])
df['DATE'] = (df['TIME_IN'] - datetime.timedelta(hours=5)).dt.date
df["DURATION"] = df["TIME_OUT"] - df["TIME_IN"] 
df.groupby("DATE")['DURATION'].sum()

Input into groupby

    TIME_IN             TIME_OUT            DATE        DURATION
0   2019-05-06 11:46:51 2019-05-06 11:50:36 2019-05-06  00:03:45
1   2019-05-02 20:47:54 2019-05-02 20:52:22 2019-05-02  00:04:28
2   2019-05-05 07:39:02 2019-05-05 07:46:34 2019-05-05  00:07:32
3   2019-05-04 17:28:52 2019-05-04 17:32:57 2019-05-04  00:04:05
4   2019-05-05 14:08:26 2019-05-05 14:14:30 2019-05-05  00:06:04

Output after groupby

DATE
2019-05-02   00:04:28
2019-05-04   00:04:05
2019-05-05   00:13:36
2019-05-06   00:03:45

Seems to work as expected.

Ian
  • 3,605
  • 4
  • 31
  • 66