Pandas MultiIndex with an unrecognised time format - how to convert time and apply calculation

Question

EDIT: Thanks to Scott Boston for advising me on to correctly post.

I have a dataframe containing clock in/out date and times from work for all employees. Sample df input is below, but the real data set has a year of data for many employees.

Question: What I would like to do is to calculate the time spent in work for each employee over the year.

df = pd.DataFrame({'name': ['Joe Bloggs', 'Joe Bloggs', 'Joe Bloggs',
...                                   'Joe Bloggs', 'Jane Doe', 'Jane Doe', 'Jane Doe',
...                                   'Jane Doe'],
...                    'Date': ['2020-06-19','2020-06-19' , '2020-06-18', '2020-06-18', '2020-06-19',
...                            '2020-06-19', '2020-06-18', '2020-06-18'],
...                    'Time': ["17:30:06", "09:00:00", "17:44:00", "08:34:02", "16:30:06",
...                                "10:00:02", "15:45:33", "09:30:33"],
...                   'type': ["Logout", "Login", "Logout",
...                          "Login", "Logout", "Login",
...                          "Logout", "Login"]})```

Hi Scott, I have embedded a link, I think this is what you were seeking? I created another sample dataframe to preserve data protection, so what I have linked doesn't have a MultiIndex.thanks — Prolle, Jun 22 '20 at 20:11
https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples?answertab=oldest#tab-top — Scott Boston, Jun 22 '20 at 20:34

warped · Answer 1 · 2020-06-24T11:15:17.223

df['Time'] = pd.to_timedelta(df['Time'])
df['Date'] = pd.to_datetime(df['Date'])
df['time_complete'] = df['Time'] + df['Date']
df.groupby(['name', 'Date']).apply(lambda x: (x.sort_values('type', ascending=True)['time_complete'].diff().dropna()))

how it works:

Convert the dates to datetime, to allow grouping.

Convert the times to timedelta, to allow subtraction.

Create a complete time, to incorporate potential nighshifts (as spotted by @ScottBoston)

Then, group by date and employee to isolate those.

So, each group now corresponds to one employee at a specific date.

The individual groups have three columns, 'type' and 'Time', 'time_complete'.

Sorting the columns by 'type' will cause logout to come before login.

Then, we take the difference (column-(n) - column-(n+1)) of column 'time_complete' within each sorted group, which gives the time spent between login and logout.

Finally, we remove null values that arise through None - column-(n).

Please add some commentary and explain in brief. Simply answering with code won't help others understand. — Vik, Jun 24 '20 at 00:17

score 2 · Accepted Answer · answered Jun 24 '20 at 01:54

You can do it this way:

#Create a datetime column combining both date and time also create year column
df['datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%Y-%m-%d %H:%M:%S')
df['year'] = df['datetime'].dt.year

#Sort the dataframe by datetime 
df = df.sort_values('datetime')

#Create "sessions" worked by Login records
session = (df['type'] == 'Login').groupby(df['name']).cumsum().rename('Session_No')

#Reshape the dataframe to get login and logouts for a session on one row
#The use diff to calculate worked during that session
df_time = df.set_index(['name', 'year', session, 'type'])['datetime']\
            .unstack().diff(axis=1).dropna(axis=1, how='all')\
            .rename(columns={'Logout':'TimeLoggedIn'})

#Sum on Name and Year
df_time.sum(level=[0,1])

Output:

         name  year TimeLoggedIn
0    Jane Doe  2020     12:45:04
1  Joe Bloggs  2020     17:40:04

Note: @warped solution works and works well, however, if you had an employee who worked overnight, I think that code breaks down. This answer should capture where an employee works past midnight.

Pandas MultiIndex with an unrecognised time format - how to convert time and apply calculation

2 Answers2

Linked