I have df called data of project records that looks something like this
project = ['Project 1','Project 1','Project 1', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3']
progress = [33.3, 66.6, 100.0, 0, 16.6, 33.3, 50.0, 66.6, 83.3, 4.76, 9.52, 14.29, 95.2, 100.0]
hours = [4, 2, 6, 0, 1, 3, 2, 0, 4, 1, 2, 3, 1, 3]
taskcount = [1, 0, 5, 0, 4, 1, 3, 0, 1, 2, 1, 2, 2, 2]
data = pd.DataFrame({'Project':project, 'Progress':progress, 'Hours':hours, 'TaskCount':taskcount})
The column Progress denotes different progress points for the project, expressed as a percent of project completion. For instance, the first record for Project 3 corresponds to the Progress value 4.76, which means that Project 3 was 4.76% complete when that particular record was logged.
The entries in the columns Hours and TaskCount show the number of hours worked on the particular project and a count of the number of tasks completed for that project, respectively, between the progress value given in the column Progress and the preceding progress value. For instance, the third record for Project 2 indicates that 3 hours were worked and 1 task completed at the progress value 33.3 (and thus in the interval (16.6, 33.3]), which means that in the time Project 2 was > 16.6% complete and <= 33.3% complete, 3 hours were worked and 1 task completed.
For each project in the df data, I need to insert rows in the df data with the missing decile-valued progress values - so 0.0%, 10.0%, 20.0%, ..., 100.0% - where each progress value implicitly corresponds to a percent interval - so (0.0%, 10.0%], (10.0%, 20.0%], ..., (90.0%, 100.0%].
If a row of the df data already has integer decile-value in the Progress column, then there is no need to insert a row for that particular value. For instance, the fifth record for Project 3 corresponds to the progress value 100.0, so there is no need to insert a new row at that progress point for Project 3.
Rows in the original df data should be binned into the row of the output df, called data_output, corresponding to the appropriate percent interval (0.0%, 10.0%], (10.0%, 20.0%], ..., (90.0%, 100.0%], with the entries in the columns Hours and TaskCount summed between the rows. For instance, in the df data, the first two records of Project 3 correspond to the Progress values 4.76 and 9.52, respectively. As 4.76% and 9.52% are in the interval (0.0%, 10.0%], the rows should be combined into a new row with Progress value 10.0, Hours value 1 + 2 = 3, and TaskCount value 2 + 1 = 3.
The output that I'm looking for is given by the df data_output:
project_output = ['Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2','Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2','Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3','Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3','Project 3']
progress_output = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
hours_output = [0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 6, 0, 0, 1, 0, 3, 2, 0, 0, 0, 4, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 4]
taskcount_output = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 5, 0, 0, 4, 0, 1, 3, 0, 0, 0, 1, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 4]
data_output = pd.DataFrame({'Project':project_output, 'Progress':progress_output, 'Hours':hours_output, 'TaskCount':taskcount_output})
I should also note that the real dfs that I'm working with are very large - they comprise about 278,000 rows - so I'm hoping to find an efficient solution. I was thinking about creating a sequence of bins using
bins = np.arange(0,101,10).astype(float)
creating a column Binned with
data['Binned'] = pd.cut(data['Progress'], bins = bins, labels = labels)
data['Binned'] = pd.factorize(data['Binned'])[0]
summing the values in the Hours and TaskCount columns for each percent interval bin for each project with,
data = data.groupby(['Project', pd.Grouper(key='Binned')])['Hours','TaskCount'].sum()
and finally renaming the column Binned to Progress, but I don't know how to integrate this with the method I was introduced to here (in the first answer to this post).