How to Optimize Performance in Nested Loops Iterating a Pandas Dataframe

Question

Suppose the below dataframe df:

import pandas as pd
data = {"Time":["2021-01-10 21:00:00", "2021-01-10 22:00:00", 
                "2021-01-10 21:30:01", "2021-01-10 21:45:00",
                "2021-01-12 09:00:00", "2021-01-12 09:30:00"],
        "ID":["1","1","2","2","2","2"],
        "Event":["cut","cut", "smooth","smooth","cut","cut"],
        "Status":["start", "complete", "start", "complete","start", "complete",]}
df = pd.DataFrame(data)  
df["Time"] = pd.to_datetime(df["Time"])  
df["ID"] = df["ID"].astype("int")  
df

My final goal is to calculate the total production time per unique 'ID', without taking into account any potential time breaks between each time interval. The start time for each ID is the 1st instance of "start" Status, and the end production time is the last instance of "complete" Status per ID. E.g., for ID==1 this is 1h (3600s), while for ID==2 is about 45min (15min in the 1st, and 30min in the 2nd time interval).

Since I would also be interested in capturing the time intervals per unique ID (e.g., ID==1 has only 1 interval which coincides with its total production time, ID==2 has 2 pairs of start-complete statuses, and hence 2 intervals), what I thought to do is create two dictionaries: 'time_diff_dict', and 'cumulativeSumId':

'time_diff_dict': key:unique ID, values: the time intervals
'cumulativeSumId': key: unique ID, values: the cumulative sum of the time intervals above

In this way, in the 'cumulativeSumId' dictionary, the last key value per each key (ID) would be equal to its total production time.

However, imagine that the real df has about 180,000 rows with about 3000 unique IDs, and it takes about 10min to terminate the below code. Probably I will have to use iterations methods like the ones described here or more efficient nested loops, however, I need some help implementing a better performance for this particular case. My current code is:

# dictionary containing the time intervals of IDs (delta of complete-start)
# key: ID, value: time interval for every unique complete-start completion per unique ID
time_diff_dict = {key: [] for key in list(df.ID.unique())}

# same structure as time_diff_dict, but here storing the time_diff_dict values per ID
cumulativeSumId = {key: [] for key in list(df.ID.unique())} 

# initialise time difference to 0 before calculating time interval
time_diff = 0

for unique_ID in df.ID.unique():
  for row in df.itertuples(index=True, name="Pandas"):
    if row.ID == unique_ID:
      if row.Status == "start":
        start = row.Time
        cumulativeSumId[unique_ID].append(sum(time_diff_dict[unique_ID]))
      elif row.Status == "complete":
        end = row.Time
        delta = end - start
        time_diff = delta.total_seconds()
        time_diff_dict[unique_ID].append(time_diff)
        cumulativeSumId[unique_ID].append(sum(time_diff_dict[unique_ID]))

dict_values = list(cumulativeSumId.values())
df["Cumulative_Time"] = [item for sublist in dict_values for item in sublist]

The result of df now:

, where e.g. for ID==1, the total production time is 3600sec, and for ID==2 is 2699sec because this is the last instance in its cumulative sum time dictionary.

After that, I create a new df with: unique IDs, "totalTimeId", and "timeIntervals":

''' 
* create list of lists 
* every sublist is a dataframe per unique ID
'''
lists_of_IDdfs =[]

for id, df_id in df.groupby("ID"):
  lists_of_IDdfs.append(df_id)

data = []
for df in range(len(lists_of_IDdfs)):
  data.append((lists_of_IDdfs[df].ID.iloc[-1], lists_of_IDdfs[df].Cumulative_Time.iloc[-1]))
df_ID_TotalTime = pd.DataFrame(data, columns= ["ID", "totalTimeId"])

'''add the respective time interval data points per unique ID'''
df_ID_TotalTime["timeIntervals"] = df_ID_TotalTime["ID"].map(time_diff_dict)
df_ID_TotalTime

Final desired result:

I would appreciate any thoughts and help! Thank you!

mozway · Accepted Answer · 2021-09-09T10:43:19.783

3

You can reshape your dataframe using pivot, compute the difference between the two datetimes and groupby "ID" to aggregate the data:

# pre-requisite ensure that Time is of datetime type
df['Time'] = pd.to_datetime(df['Time'])

(df.pivot(index=['ID', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: d['complete']-d['start'])
   .groupby('ID')['time'].sum()
)

output:

ID
1   0 days 00:30:00
2   0 days 00:24:58

To get the output in seconds:

(df.pivot(index=['ID', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: d['complete']-d['start'])
   .groupby('ID')['time'].sum()
   .dt.total_seconds()
)

output:

ID
1    1800.0
2    1498.0

alternative output:

(df.pivot(index=['ID', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: (d['complete']-d['start']).dt.total_seconds())
   .groupby('ID')['time'].agg(totalTimeId='sum', timeIntervals=list)
)

output:

    totalTimeId    timeIntervals
ID                              
1        3600.0         [3600.0]
2        2699.0  [1800.0, 899.0]

edit how to handle duplicates:

You need to add a secondary index that is unique (ID2)

(df.assign(ID2=df.groupby(['ID', 'Event', 'Status']).cumcount())
   .pivot(index=['ID', 'ID2', 'Event'], columns='Status', values='Time')
   .assign(time=lambda d: (d['complete']-d['start']).dt.total_seconds())
   .groupby('ID')['time'].agg(totalTimeId='sum', timeIntervals=list)
)

input:

                 Time  ID   Event    Status
0 2021-01-10 21:00:00   1     cut     start
1 2021-01-10 22:00:00   1     cut  complete
2 2021-01-10 21:30:01   2  smooth     start
3 2021-01-10 21:45:00   2  smooth  complete
4 2021-01-12 09:00:00   2     cut     start
5 2021-01-12 09:30:00   2     cut  complete
6 2021-01-12 09:30:00   2     cut     start
7 2021-01-12 09:35:00   2     cut  complete

intermediate:

Status                   complete               start
ID ID2 Event                                         
1  0   cut    2021-01-10 22:00:00 2021-01-10 21:00:00
2  0   cut    2021-01-12 09:30:00 2021-01-12 09:00:00
       smooth 2021-01-10 21:45:00 2021-01-10 21:30:01
   1   cut    2021-01-12 09:35:00 2021-01-12 09:30:00

output:

    totalTimeId           timeIntervals
ID                                     
1        3600.0                [3600.0]
2        2999.0  [1800.0, 899.0, 300.0]

edited Sep 09 '21 at 10:43

answered Sep 08 '21 at 12:10

mozway

194,879
13
39
75

Thanks @mozway. I received an error for reshaping indices which I would investigate, however, please have a look at my post again. I had to re-edit to point out details not captured in the df sample – dimi_fn Sep 08 '21 at 14:29
You need to ensure that time is of datetime type: `df['Time'] = pd.to_datetime(df['Time'])`, check the updated answer – mozway Sep 08 '21 at 14:35
Done, if you now copy-paste all code snippets you should run them with no problem. – dimi_fn Sep 08 '21 at 14:41
@dimi_fn so, it works right? I get `1: 3600.0 / 2: 2699.0` with the updated dataset – mozway Sep 08 '21 at 14:47
Thanks a lot @mozway, in this df sample it works perfect! In my original I receive "Index contains duplicate entries, cannot reshape" which I have to look into. However, even for this df sample do you maybe have an idea on how I can reach the end result as shown in the last df_ID_TotalTime df? – dimi_fn Sep 08 '21 at 14:56
Thank you @mozway, I accepted the answer because it does solve my problem here! Now as I told you earlier, I'm dealing with the error "Index contains duplicate entries, cannot reshape" in the real df, and still trying to solve it e.g. with solutions like [these](https://stackoverflow.com/questions/28651079/pandas-unstack-problems-valueerror-index-contains-duplicate-entries-cannot-re). I don't expect you to spend more time on this post, but since you made your hands dirty and understood the concept, I would really appreciate if you think of a potential solution please let me know. Thanks again! – dimi_fn Sep 09 '21 at 00:10
Can you provide an example of duplicated data and the matching expected output? – mozway Sep 09 '21 at 04:20
thanks a lot @mozway, and I appreciate it! Imagine that the value "time" is duplicate across the same ID or IDs, i.e., the "start" time of a time interval is equal with its previous "complete" time value (and not e.g. after 1 second or more). You can open [this shared colab notebook](https://colab.research.google.com/drive/1HYmfeYXL3FkPNIUaT7MZ550nxF_g_K3V?usp=sharing) displaying the same df but with the new indices 6&7, where time on index 5 is equal to that of 6. You can see that your code gives the error I was telling you about. I would appreciate, again, any contribution/thoughts! – dimi_fn Sep 09 '21 at 10:02
If I could, I would re-upvote! Thank you so much @mozway, this fixed my issue! (the only detail I would fix in the real df is that the provided list of time-intervals does not give the time interval values in the order of the IDs, I suppose this happens inevitably because of the 'intermediate' stage of ID2, however, I checked that the sum of all intervals is equal to the total time, which was important). Thank you a lot again! – dimi_fn Sep 09 '21 at 11:31
@dimi_fn great, glad it worked. If you want the intervals in order of the Events you can use Event first in the groupby. If you really want you can check my other posts and upvote if you find them useful ;) – mozway Sep 09 '21 at 11:35

joAschauer · Answer 2 · 2021-09-09T10:04:12.563

0

You can groupby ID and then calculate timedeltas:

df['Cumulative_Time'] = df.groupby('ID')['Time'].apply(lambda x: x - x.min()).dt.total_seconds()

and in order to get your desired output you can do the following, inspired by @mozway's answer.

(df.groupby(['ID','Event'])['Time']
 .apply(lambda x: x.max() - x.min()).dt.total_seconds()
 .groupby('ID')
 .agg(totalTimeId='sum', timeIntervals=list))

edited Sep 09 '21 at 10:04

answered Sep 08 '21 at 12:20

joAschauer

106
7

I'm sorry @joAschauer, this worked and thank you, but I should have probably pointed out that I should not capture any potential time breaks between each time interval. My original df did not help to capture that scenario, so I edited the post again! Thanks for your help, I hope you take another look! – dimi_fn Sep 08 '21 at 14:48
hi @dimi_fn, I edited the answer accordingly, maybe that can solve your duplicate index issue? – joAschauer Sep 09 '21 at 10:05
thanks @joAschauer for having another look! Check [this notebook](https://colab.research.google.com/drive/1mdA7jK--YjSGIUmRVuHffPbWjtZJrrEE?usp=sharing) you can see that cum time takes into account the time breaks, and the len of time intervals is not captured correctly – dimi_fn Sep 09 '21 at 11:35

How to Optimize Performance in Nested Loops Iterating a Pandas Dataframe

2 Answers2

edit how to handle duplicates: