0

Let's suppose we have this dataframe called 'chosen_data' that contains start & end dates & duration of available slots in which we can perform a task (I shall provide a sample so you can reproduce it if you wish). Let's fix the task duration to x = 24h, if my first available slot's duration is 100hours I want to subdivide it into 100-24=76 portions where each one's duration is equal to 'x'. I want to store these portions into a new dataframe (called result) containing start & end dates of each subdivision.

The actual script does the job however it takes forever to run when I apply it on bigger Datasets, I just need to replace the for-loops :

import pandas as pd
import numpy as np
data = [['3/6/1999 5:00','3/8/1999 0:00',43],['3/8/1999 22:00','3/12/1999 19:00',93],['3/12/1999  22:00','3/21/1999 20:00',214],['3/22/1999 19:00','3/26/1999 3:00',80]]
chosen_data = pd.DataFrame(data, columns = ['Start Date', 'End Date', 'Duration [Hours]'])
chosen_data[['Start Date','End Date']] = chosen_data[['Start Date','End Date']].apply(pd.to_datetime)

result = pd.DataFrame(columns=['Start Date','End Date', 'Duration [Hours]'])
x = 24 # Task duration(normaly it's a user input variable)
for i in chosen_data.index : 
    fenetre = chosen_data.copy()
    num_subdiv = fenetre['Duration [Hours]'].loc[i] - x    
    for j in range(int(num_subdiv)) :        
        fenetre['Start Date'].loc[i] = fenetre['Start Date'].loc[i] + pd.to_timedelta(1, unit ='h')
        #fenetre['Start Date'].loc[i] = fenetre['Start Date'].loc[i] + pd.to_timedelta(j, unit ='h')
        fenetre['End Date'].loc[i] = fenetre['Start Date'].loc[i] + pd.to_timedelta(x , unit ='h')
        #fenetre['End Date'].loc[i] = fenetre['End Date'].loc[i] + pd.to_timedelta(j - substract ,   unit ='h')      
        result = result.append(fenetre.loc[i] , ignore_index= True)
        
result['Duration [Hours]'] = (result['End Date'] - result['Start Date']) / pd.to_timedelta(1, unit = 'h')

I'm looking for a better way to achieve this purpose, here's the desired output :

enter image description here

Keyser Soze
  • 262
  • 3
  • 11
  • `result['Duration [Hours]'] = pd.to_timedelta(x , unit ='h')` does same at the end – azro May 05 '21 at 11:39
  • That's true, or simply result['duration'] = x. However, I used that formula to check whether my output is correct. It helped me debug – Keyser Soze May 05 '21 at 11:46
  • You know that this line of code executes in microseconds right ? and it's outside the for-loop so that's not the origin of my execution time problem neither the answer to this post ... – Keyser Soze May 05 '21 at 11:47

1 Answers1

1

Using Explode value to multiple rows :

x = 24
chosen_data['values'] = (chosen_data['Duration [Hours]'] - x).apply(lambda a: list(range(1, 1 + a)))
result: pd.DataFrame = pd.DataFrame(chosen_data['values'].tolist(), index=chosen_data['Start Date']) \
    .stack().reset_index()[["Start Date", 0]]

result.columns = ["Start Date", "ToAdd"]
result['Start Date'] = result['Start Date'] + pd.to_timedelta(result['ToAdd'], unit='h')
result['End Date'] = result['Start Date'] + pd.to_timedelta(x, unit='h')
result['Duration'] = x

Printing choosen_data gives, then it explodes for each value of values column

Start Date End Date Duration [Hours] values
0 1999-03-06 05:00:00 1999-03-08 00:00:00 27 [1, 2, 3]
1 1999-03-08 22:00:00 1999-03-12 19:00:00 26 [1, 2]
2 1999-03-12 22:00:00 1999-03-21 20:00:00 28 [1, 2, 3, 4]
3 1999-03-22 19:00:00 1999-03-26 03:00:00 26 [1, 2]
azro
  • 53,056
  • 7
  • 34
  • 70
  • Now that is smooth my man ! Thanks buddy, I never came across the 'Stack()' & 'tolist()' methods so this solution enlightened me & shall be useful for my project's continuity. – Keyser Soze May 05 '21 at 12:00