1

I have a Dataframe where I want to groupby a couple conditions and then sum a few fields. When I sum the values I want to flatten the values

With an input of:

UserID    Site    First Name    Last Name    May Hours    June Hours     July Hours
6385         A          John          Doe            8             9             10
6385         A          John          Doe            3             7              8
4756         A          Mark         Mean            7             6              5
4756         B          Mark         Mean            3             2              3

and grouping by UserID and Site, I want an output of:

UserID    Site    First Name    Last Name    May Hours    June Hours     July Hours
6385         A          John          Doe           11            16             18
4756         A          Mark         Mean            7             6              5
4756         B          Mark         Mean            3             2              3

It can be assumed that besides the UserID, Site, and the Monthly Hours, all other column values are the same.

My current idea was to do:

for group_name, group_df in df.groupby(by=['UserID','Site']):
    if len(group_df) > 1: # We have rows that need to be summed
        sums = group_df[['May Hours','June Hours','July Hours']].sum()
        # Assign this sum to the first row in the grouped_df
        # Remove all following rows in grouped_df (we already got their sum)
        # Append this to a new dataframe
    else:
        # Append this to a new dataframe

While this solution may work, it is definitely not efficient but I am unsure of the best way to flatten the dataframe

Bijan
  • 7,737
  • 18
  • 89
  • 149
  • 2
    You can do, ```df.groupby(['UserID', 'Site', 'First Name', 'Last Name'], as_index=False).sum()``` – sushanth Jul 24 '20 at 04:00

1 Answers1

2

When I have situations like this I often pivot and flatten the data frame before. A few key reasons are that it keeps things in a data frame and allows me to work with the index for things like joins. Nothing against groupby, though, just prefer pivot_table.

df = pd.pivot_table(
    index=df.columns[:4],
    values=df.columns[4:],
    aggfunc=np.sum
)

df = pd.DataFrame(df.to_records())
zerecees
  • 697
  • 4
  • 13