Let's assume I have a pandas dataframe like this:
import pandas as pd
d = {'entrance': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'e', 'e', 'e', 'e'],
'hour': [3, 4, 4, 4, 2, 2, 2, 2, 0, 0, 0, 1, 12, 12, 12, 12, 7, 7, 8, 8],
'in': [3, 2, 0, 4, 3, 2, 1, 0, 2, 1, 5, 0, 1, 1, 2, 3, 10, 10, 1, 5],
'out': [0, 1, 0, 0, 1, 2, 1, 3, 1, 1, 0, 2, 0, 2, 3, 0, 0, 0, 25, 0]}
df = pd.DataFrame(data=d)
df
entrance hour in out
a 3 3 0
a 4 2 1
a 4 0 0
a 4 4 0
b 2 3 1
b 2 2 2
b 2 1 1
b 2 0 3
c 0 2 1
c 0 1 1
c 0 5 0
c 1 0 2
d 12 1 0
d 12 1 2
d 12 2 3
d 12 3 0
e 7 10 0
e 7 10 0
e 8 1 25
e 8 5 0
I would like to sum the values in the columns in
and out
grouped by hour
. I know I can achieve this with:
df_grouped = df.groupby(['hour']).agg({'in':'sum', 'out': 'sum'}).reset_index()
However, the values in the columns in
and out
should only be grouped by the first hour
record of each entrance
group. So for entrance group 'a', the values in
and out
should be summed for the first record of the corresponding hours, which is in this case hour '3'. The desired output should be:
hour in out
3 9 1
2 6 7
0 8 4
12 7 5
7 27 25
What is the best way to do this?