Suppose the below dataframe df:
import pandas as pd
data = {"Time":["2021-01-10 21:00:00", "2021-01-10 22:00:00",
"2021-01-10 21:30:01", "2021-01-10 21:45:00",
"2021-01-12 09:00:00", "2021-01-12 09:30:00"],
"ID":["1","1","2","2","2","2"],
"Event":["cut","cut", "smooth","smooth","cut","cut"],
"Status":["start", "complete", "start", "complete","start", "complete",]}
df = pd.DataFrame(data)
df["Time"] = pd.to_datetime(df["Time"])
df["ID"] = df["ID"].astype("int")
df
My final goal is to calculate the total production time per unique 'ID', without taking into account any potential time breaks between each time interval. The start time for each ID is the 1st instance of "start" Status, and the end production time is the last instance of "complete" Status per ID. E.g., for ID==1 this is 1h (3600s), while for ID==2 is about 45min (15min in the 1st, and 30min in the 2nd time interval).
Since I would also be interested in capturing the time intervals per unique ID (e.g., ID==1 has only 1 interval which coincides with its total production time, ID==2 has 2 pairs of start-complete statuses, and hence 2 intervals), what I thought to do is create two dictionaries: 'time_diff_dict', and 'cumulativeSumId':
- 'time_diff_dict': key:unique ID, values: the time intervals
- 'cumulativeSumId': key: unique ID, values: the cumulative sum of the time intervals above
In this way, in the 'cumulativeSumId' dictionary, the last key value per each key (ID) would be equal to its total production time.
However, imagine that the real df has about 180,000 rows with about 3000 unique IDs, and it takes about 10min to terminate the below code. Probably I will have to use iterations methods like the ones described here or more efficient nested loops, however, I need some help implementing a better performance for this particular case. My current code is:
# dictionary containing the time intervals of IDs (delta of complete-start)
# key: ID, value: time interval for every unique complete-start completion per unique ID
time_diff_dict = {key: [] for key in list(df.ID.unique())}
# same structure as time_diff_dict, but here storing the time_diff_dict values per ID
cumulativeSumId = {key: [] for key in list(df.ID.unique())}
# initialise time difference to 0 before calculating time interval
time_diff = 0
for unique_ID in df.ID.unique():
for row in df.itertuples(index=True, name="Pandas"):
if row.ID == unique_ID:
if row.Status == "start":
start = row.Time
cumulativeSumId[unique_ID].append(sum(time_diff_dict[unique_ID]))
elif row.Status == "complete":
end = row.Time
delta = end - start
time_diff = delta.total_seconds()
time_diff_dict[unique_ID].append(time_diff)
cumulativeSumId[unique_ID].append(sum(time_diff_dict[unique_ID]))
dict_values = list(cumulativeSumId.values())
df["Cumulative_Time"] = [item for sublist in dict_values for item in sublist]
The result of df now:
, where e.g. for ID==1, the total production time is 3600sec, and for ID==2 is 2699sec because this is the last instance in its cumulative sum time dictionary.
After that, I create a new df with: unique IDs, "totalTimeId", and "timeIntervals":
'''
* create list of lists
* every sublist is a dataframe per unique ID
'''
lists_of_IDdfs =[]
for id, df_id in df.groupby("ID"):
lists_of_IDdfs.append(df_id)
data = []
for df in range(len(lists_of_IDdfs)):
data.append((lists_of_IDdfs[df].ID.iloc[-1], lists_of_IDdfs[df].Cumulative_Time.iloc[-1]))
df_ID_TotalTime = pd.DataFrame(data, columns= ["ID", "totalTimeId"])
'''add the respective time interval data points per unique ID'''
df_ID_TotalTime["timeIntervals"] = df_ID_TotalTime["ID"].map(time_diff_dict)
df_ID_TotalTime
Final desired result:
I would appreciate any thoughts and help! Thank you!