2

I have a dataframe in which a user's daily entry and exit is noted, but the user comes at different time each day, for example below is the input user data

Date    UserID  Intime  Outtime
2018-06-29  73456   2018-06-29 07:30:54  2018-06-29 15:30:13
2018-06-28  73456   2018-06-28 08:29:23  2018-06-28 17:28:31
2018-06-27  73456   2018-06-27 11:26:02  2018-06-27 19:30:09
2018-06-26  73456   2018-06-26 14:20:42  2018-06-26 23:25:38
2018-06-25  73456   2018-06-25 07:31:19  2018-06-25 16:24:26

I need to maintain a hourly record of this user. so in a separate sheet i have the date and time on hourly basis. i need to add 1 in the user field, in which hour the user is in. for example output data.

Hours                User
2018-06-29 0:00:00    
2018-06-29 1:00:00
2018-06-29 2:00:00
2018-06-29 3:00:00
2018-06-29 4:00:00
2018-06-29 5:00:00
2018-06-29 6:00:00
2018-06-29 7:00:00    1
2018-06-29 8:00:00    1
2018-06-29 9:00:00    1
2018-06-29 10:00:00    1
2018-06-29 11:00:00    1
2018-06-29 12:00:00    1
2018-06-29 13:00:00    1
2018-06-29 14:00:00    1
2018-06-29 15:00:00    1
2018-06-29 16:00:00
2018-06-29 17:00:00

I am able to create hour column but not able to mark attendance for hour in which user is in.

Any help will be highly appreciated. Thank you !

Chandella07
  • 2,089
  • 14
  • 22

2 Answers2

2

Try this:

Build the hourly set

s = pd.date_range(df1.index[0], df1.index[-1]+pd.DateOffset(1), freq='H')
idx = pd.period_range(df1.index[0], df1.index[-1]+pd.DateOffset(1), freq='H')
idx = idx[:-1]

Find when the index is inside the range of Intime and Outtime

sol = [int((s[i] >= df1.iloc[j,1] - pd.DateOffset(hours=1)) & (s[i] <= df1.iloc[j,2])) for j in range(len(df1)) for i in range(len(idx))]

Use numpy to reshape the list into a friendlier format

sol2 = np.array(sol)
sol3 = np.reshape(sol2, (s.shape[0]-1,len(df1)),order = 'F')

Build the desired series

ans = pd.Series(np.amax(sol3, axis=1),idx.values)

display the results

print(ans)

output (for last day):

2018-06-29 00:00    0
2018-06-29 01:00    0
2018-06-29 02:00    0
2018-06-29 03:00    0
2018-06-29 04:00    0
2018-06-29 05:00    0
2018-06-29 06:00    0
2018-06-29 07:00    1
2018-06-29 08:00    1
2018-06-29 09:00    1
2018-06-29 10:00    1
2018-06-29 11:00    1
2018-06-29 12:00    1
2018-06-29 13:00    1
2018-06-29 14:00    1
2018-06-29 15:00    1
2018-06-29 16:00    0
2018-06-29 17:00    0
2018-06-29 18:00    0
2018-06-29 19:00    0
2018-06-29 20:00    0
2018-06-29 21:00    0
Yuca
  • 6,010
  • 3
  • 22
  • 42
  • Thanks for answering, df1 you are refering to which dataframe ? as per your code i am taking df1 as input dataframe as mentioned in the question. I am getting `TypeError: unsupported operand type(s) for +: 'int' and 'datetime.timedelta'` on line `s = pd.date_range(df1.index[0], df1.index[-1]+pd.DateOffset(1), freq='H')` after this. using python 2.7 – Chandella07 Nov 03 '18 at 17:28
  • hmm, this is for python 3, but for a workaround call pd.to_datetime on the index :) – Yuca Nov 03 '18 at 21:58
  • any feedback from OP? – Yuca Nov 07 '18 at 12:32
  • i was getting some index errors while creating the `sol = [int((s[i] >= df1.iloc[j,1] - pd.DateOffset(hours=1)) & (s[i] <= df1.iloc[j,2])) for j in range(len(df1)) for i in range(len(idx))]` anyways i have developed a solution in python 2.7, see my answer. Thanks again for your help. – Chandella07 Nov 08 '18 at 08:21
  • the index errors were due datetime formats. To way to fix it is to do `df1.index = pd.to_datetime(df1.index)`. Does doing that produce the desired results? – Yuca Nov 08 '18 at 15:17
  • I have used `df1.index = pd.to_datetime(df1.index)` it is working fine, able to create df in datetime format but i am getting index error after this on `sol = [int((s[i] >= df1.iloc[j,1] - pd.DateOffset(hours=1)) & (s[i] <= df1.iloc[j,2])) for j in range(len(df1)) for i in range(len(idx))]`, anyways i am able to solve this problem, upvoting your answer for your efforts, sorry i can't debug this more. – Chandella07 Nov 08 '18 at 16:38
2

below is the solution that i have developed. using python 2.7

import pandas as pd

data = pd.read_excel("June_2018.xlsx", "sheetname")
df = pd.DataFrame(data)  #created dataframe from excel data

my_df = pd.DataFrame({'Hours': pd.date_range("2018-06-01", "2018-06-30", freq='1H', closed='left')})  #created new dataframe with hourly frequency

df['Intime'] = df['Intime'].apply(lambda dt: dt.replace(minute=0, second=0))  #round off min and seconds to 0
df['Outtime'] = df['Outtime'].apply(lambda dt: dt.replace(minute=0, second=0))  #round off min and seconds to 0

for idx in set(df["UserID"]):  #i have multiple userid so iterating over users
    emp_data =  df[df["UserID"] == idx]  #filtered the table on user id basis
    for item in set(emp_data["Date"].dt.date):  #iterating over date
        in_time = emp_data["Intime"][emp_data["Date"] == item]  #finding intime
        out_time = emp_data["Outtime"][emp_data["Date"] == item]  #finding outtime

        my_df.loc[(my_df["Hours"].values >= in_time.values) & (my_df["Hours"].values <= out_time.values), idx] = 1  #adding 1 where hourly value is in range

print my_df  #printing the df
Chandella07
  • 2,089
  • 14
  • 22