Is there a way to groupby sum a pandas dataframe in a sequence?

Question

I have a dataframe that looks like this:

      emp    job phase   cat  hours equipnum equipcode  equiphours   equipdate
0  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
1  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
2  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
3  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
4  OO003  19526   OH   MAT       1   AIR012     E-REV         1.0  2020-01-24
5  OO003  19526   OH   MAT       1   AIR012     E-REV         1.0  2020-01-24
6  OO003  19526   OH   MAT       1   AIR012     E-REV         1.0  2020-01-24
7  OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
8  OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
9  OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
10 UR003  19713   95L  9512      1     None      None         0.0  2020-01-24
11 UR003  19713   95L  9512      1     None      None         0.0  2020-01-24
12 UR003  19713   95L  9512      1     None      None         0.0  2020-01-24
13 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
14 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
15 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
16 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
17 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
18 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
19 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24

Would there be a way to groupby sum only the hours column for the first 8 rows and then the following 2 rows for each unique employee number (emp)?

The final dataframe should look like this:


     emp    job phase   cat  hours equipnum equipcode  equiphours   equipdate
0   OO003  19713   95L  9512      4     None      None         0.0  2020-01-24
1   OO003  19526   OH   MAT       3   AIR012     E-REV         1.0  2020-01-24
2   OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
3   OO003  19486   52L  5212      2     None      None         0.0  2020-01-24
4   UR003  19713   95L  9512      3     None      None         0.0  2020-01-24
5   UR003  19526    OH   MAT      5     None      None         0.0  2020-01-24
6   UR003  19526    OH   MAT      2     None      None         0.0  2020-01-24

Thank you for the help!

paste your data in the body of your question, don't link an image - we can't copy and paste your sample data for testing if it's an image. — gbeaven, Jan 27 '20 at 19:10
See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — alkasm, Jan 27 '20 at 19:13
Also, it would help to have a better description of what you're trying to accomplish. The given group/sum rules would not map the given input to the provided output (sums don't match anything in the output) — G. Anderson, Jan 27 '20 at 19:15
Sorry for the confusion. I fixed the data in the question. I am trying to calculate overtime for the 2 employees above (OO003 and UR003). They each worked 10 hours on 1/24/20 on multiple job numbers. I would like to find a way to create the second dataframe on the bottom, which is to first calculate regular hours (8) and the remaining (2). Apologies for the confusion. — BPCAL, Jan 27 '20 at 19:57

ALollz · Accepted Answer · 2020-01-27T20:49:35.210

You need 2 groupbys. The first creates the cumulative number of hours worked within employee. Then you groupby the employee, job, and whether the cumulative number of hours worked is <= 8. Aggregate the columns accordingly.

s = df.groupby('emp').hours.cumsum()
#s = df.groupby('emp').cumcount()+1 # If truly rows, not hours

# `first` for everything but hours and group keys. `sum` for hours
agg_d = {x: 'first' for x in df.columns.difference(['hours', 'job', 'emp'])}
agg_d['hours'] = 'sum'

res = (df.groupby(['job', 'emp', s.le(8).rename('drop')], sort=False)
         .agg(agg_d)
         .reset_index()
         .drop(columns='drop'))

print(res)
     job    emp   cat equipcode   equipdate  equiphours equipnum phase  hours
0  19713  OO003  9512      None  2020-01-24         0.0     None   95L      4
1  19526  OO003   MAT     E-REV  2020-01-24         1.0   AIR012    OH      3
2  19486  OO003  5212      None  2020-01-24         0.0     None   52L      1
3  19486  OO003  5212      None  2020-01-24         0.0     None   52L      2
4  19713  UR003  9512      None  2020-01-24         0.0     None   95L      3
5  19526  UR003   MAT      None  2020-01-24         0.0     None    OH      5
6  19526  UR003   MAT      None  2020-01-24         0.0     None    OH      2

Thank you so much for the quick answer. I forgot to ask one more thing. Is it possible to assign a type number for reg/ot where regular hours is 1 and overtime hours is 2? — BPCAL, Jan 27 '20 at 20:56
@BPCAL Sure. In the second part you can groupby `s.le(8).rename('type').map({True: 1, False: 2})` instead of my current `s` and then remove the `.drop(columns='drop`)` part. Basically I map True/False to your two `type`s and then since you want that label we no longer need to drop it in the end. — ALollz, Jan 27 '20 at 21:01
Wow thank you very much for the help! Very much appreciated! — BPCAL, Jan 27 '20 at 21:05

Is there a way to groupby sum a pandas dataframe in a sequence?

1 Answers1