Binning by Percent Progress Intervals by Project in pandas

Question

I have df called data of project records that looks something like this

project = ['Project 1','Project 1','Project 1', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3'] 
progress = [33.3, 66.6, 100.0, 0, 16.6, 33.3, 50.0, 66.6, 83.3, 4.76, 9.52, 14.29, 95.2, 100.0]
hours = [4, 2, 6, 0, 1, 3, 2, 0, 4, 1, 2, 3, 1, 3]
taskcount = [1, 0, 5, 0, 4, 1, 3, 0, 1, 2, 1, 2, 2, 2]

data = pd.DataFrame({'Project':project, 'Progress':progress, 'Hours':hours, 'TaskCount':taskcount})

The column Progress denotes different progress points for the project, expressed as a percent of project completion. For instance, the first record for Project 3 corresponds to the Progress value 4.76, which means that Project 3 was 4.76% complete when that particular record was logged.

The entries in the columns Hours and TaskCount show the number of hours worked on the particular project and a count of the number of tasks completed for that project, respectively, between the progress value given in the column Progress and the preceding progress value. For instance, the third record for Project 2 indicates that 3 hours were worked and 1 task completed at the progress value 33.3 (and thus in the interval (16.6, 33.3]), which means that in the time Project 2 was > 16.6% complete and <= 33.3% complete, 3 hours were worked and 1 task completed.

For each project in the df data, I need to insert rows in the df data with the missing decile-valued progress values - so 0.0%, 10.0%, 20.0%, ..., 100.0% - where each progress value implicitly corresponds to a percent interval - so (0.0%, 10.0%], (10.0%, 20.0%], ..., (90.0%, 100.0%].

If a row of the df data already has integer decile-value in the Progress column, then there is no need to insert a row for that particular value. For instance, the fifth record for Project 3 corresponds to the progress value 100.0, so there is no need to insert a new row at that progress point for Project 3.

Rows in the original df data should be binned into the row of the output df, called data_output, corresponding to the appropriate percent interval (0.0%, 10.0%], (10.0%, 20.0%], ..., (90.0%, 100.0%], with the entries in the columns Hours and TaskCount summed between the rows. For instance, in the df data, the first two records of Project 3 correspond to the Progress values 4.76 and 9.52, respectively. As 4.76% and 9.52% are in the interval (0.0%, 10.0%], the rows should be combined into a new row with Progress value 10.0, Hours value 1 + 2 = 3, and TaskCount value 2 + 1 = 3.

The output that I'm looking for is given by the df data_output:

project_output = ['Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2','Project 2', 'Project 2', 'Project 2', 'Project 2', 'Project 2','Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3','Project 3', 'Project 3', 'Project 3', 'Project 3', 'Project 3','Project 3']
progress_output = [0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
hours_output = [0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 6, 0, 0, 1, 0, 3, 2, 0, 0, 0, 4, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 4]
taskcount_output = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 5, 0, 0, 4, 0, 1, 3, 0, 0, 0, 1, 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 4]

data_output = pd.DataFrame({'Project':project_output, 'Progress':progress_output, 'Hours':hours_output, 'TaskCount':taskcount_output})

I should also note that the real dfs that I'm working with are very large - they comprise about 278,000 rows - so I'm hoping to find an efficient solution. I was thinking about creating a sequence of bins using

bins = np.arange(0,101,10).astype(float)

creating a column Binned with

data['Binned'] = pd.cut(data['Progress'], bins = bins, labels = labels)
data['Binned'] = pd.factorize(data['Binned'])[0]

summing the values in the Hours and TaskCount columns for each percent interval bin for each project with,

data = data.groupby(['Project', pd.Grouper(key='Binned')])['Hours','TaskCount'].sum()

and finally renaming the column Binned to Progress, but I don't know how to integrate this with the method I was introduced to here (in the first answer to this post).

Serge de Gosson de Varennes · Answer 1 · 2023-02-08T16:27:06.540

I think the best way of doing this is the following.

import pandas as pd

def create_missing_decile_values(df):
    progress_decile_values = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
    data_output = pd.DataFrame(columns=['Project', 'Progress', 'Hours', 'TaskCount'])
    unique_projects = df['Project'].unique()
    for project in unique_projects:
        project_df = df[df['Project'] == project].copy()
        project_df['Progress'] = (project_df['Progress'] // 10 * 10).astype(int)
        project_df = project_df.groupby('Progress').agg({'Hours': 'sum', 'TaskCount': 'sum'})
        project_df = project_df.reindex(progress_decile_values, fill_value=0)
        project_df['Project'] = project
        project_df = project_df.reset_index().rename(columns={'index': 'Progress'})
        data_output = data_output.append(project_df)
    return data_output

data_output = create_missing_decile_values(data)

which returns:

      Project Progress Hours TaskCount
0   Project 1        0     0         0
1   Project 1       10     0         0
2   Project 1       20     0         0
3   Project 1       30     4         1
4   Project 1       40     0         0
5   Project 1       50     0         0
6   Project 1       60     2         0
7   Project 1       70     0         0
8   Project 1       80     0         0
9   Project 1       90     0         0
10  Project 1      100     6         5
0   Project 2        0     0         0
1   Project 2       10     1         4
2   Project 2       20     0         0
3   Project 2       30     3         1
4   Project 2       40     0         0
5   Project 2       50     2         3
6   Project 2       60     0         0
7   Project 2       70     0         0
8   Project 2       80     4         1
9   Project 2       90     0         0
10  Project 2      100     0         0
0   Project 3        0     3         3
1   Project 3       10     3         2
2   Project 3       20     0         0
3   Project 3       30     0         0
4   Project 3       40     0         0
5   Project 3       50     0         0
6   Project 3       60     0         0
7   Project 3       70     0         0
8   Project 3       80     0         0
9   Project 3       90     1         2
10  Project 3      100     3         2

Note: I think you are mistanken on the expected ouput for project 3.

Thanks, @SergedeGossondeVarennes. Your solution works with my real data set, and it runs significantly faster than my original solution (something like 130 sec vs 300 sec), but it still doesn't run quite as fast as I was hoping for. Is there any way to implement a solution without looping through distinct projects? — blargen, Feb 08 '23 at 15:47
I'm thinking of the first answer given to this question: https://stackoverflow.com/questions/75365789/inserting-rows-with-consecutive-dates-for-different-projects-with-different-star — blargen, Feb 08 '23 at 16:03
@blargen There is a huge difference between what you what to do and the answer you are referring to. In that case, they explode the dataframe on dates. You do not have tht luxury. I'v tried to circumvent loops but I always en up in that. Wait a while. Maybe someone will come up with a bright idea. — Serge de Gosson de Varennes, Feb 08 '23 at 16:37

Binning by Percent Progress Intervals by Project in pandas

1 Answers1