0

Similar to this previous post, I would like to derive the percentage within each group but based on the sum of multiple columns and add subtotals. For example given the dataframe below:

import numpy as np
import pandas as pd

np.random.seed(0)

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)],
               'sales2': [np.random.randint(100000, 999999) for _ in range(12)],
               'sales3': [np.random.randint(100000, 999999) for _ in range(12)]})

The ideal results would yield:

enter image description here

Update

It would be ideal to groupby both state and office id for situations where there are repeating values for office id column. Here is an example:

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': [1,1,1,2,2,2] * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)],
               'sales2': [np.random.randint(100000, 999999) for _ in range(12)],
               'sales3': [np.random.randint(100000, 999999) for _ in range(12)]})

This should then yield:

enter image description here

GeoPy
  • 1,556
  • 3
  • 17
  • 21
  • I'm a little later, but please see my answer. I've actually been working on it since you posted this question :) –  Dec 31 '21 at 18:42

2 Answers2

1

We need create you need step by step, include groupby with append the subtotal per group on column , then transform the total sum with state

s = df.groupby('state')[['sales','sales2','sales3']].sum().assign(office_id = 'Subtotal').set_index('office_id',append=True)
out = pd.concat([df,s.reset_index()]).sort_values('state')
out['Subtotal'] = out[['sales','sales2','sales3']].sum(axis=1)
v = out.groupby('state')['Subtotal'].transform('sum')/2
out.update(out[['sales','sales2','sales3','Subtotal']].div(v,axis=0))
out
   state office_id     sales    sales2    sales3  Subtotal
3     AZ         4  0.047124  0.175385  0.118068  0.340578
7     AZ         2  0.041571  0.087926  0.087902  0.217399
11    AZ         6  0.156107  0.131998  0.153919  0.442023
0     AZ  Subtotal  0.244802  0.395309  0.359889  1.000000
0     CA         1  0.062026  0.127860  0.145870  0.335756
4     CA         5  0.150188  0.107702  0.068203  0.326092
8     CA         3  0.108636  0.129193  0.100323  0.338152
1     CA  Subtotal  0.320849  0.364755  0.314396  1.000000
2     CO         3  0.058604  0.072756  0.142734  0.274095
6     CO         1  0.108667  0.208210  0.145513  0.462390
10    CO         5  0.127604  0.095630  0.040282  0.263516
2     CO  Subtotal  0.294875  0.376596  0.328529  1.000000
1     WA         2  0.106233  0.081434  0.085797  0.273463
5     WA         6  0.091156  0.127159  0.138270  0.356585
9     WA         4  0.108694  0.195807  0.065451  0.369952
3     WA  Subtotal  0.306083  0.404399  0.289518  1.000000
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This code works great for the given example, however, how can this be implemented if you want to groupby state and office id for situations where office id has multiple repeating values? Ive updated the question if that helps. – GeoPy Dec 31 '21 at 18:23
  • 1
    For anyone who wonders I simply added another groupby function at the end which seems to give the right result - out = out.groupby(['state','office_id']).sum(). However, I am still open to other suggestions if there is a more efficent method. – GeoPy Dec 31 '21 at 18:43
1

Here's a one-liner (except for the extra cols variable to shorten the code overall):

cols = df.filter(like='sales').columns

new_df = df.assign(**{k:list(v.values()) for k,v in df[cols].apply(lambda c:c/c.groupby(df['state']).transform(sum)).to_dict().items()}).groupby('state').apply(lambda x:x.append(pd.Series({'office_id':'subtotal',**x.sum().to_dict()},name=''))).droplevel(1).drop('state',axis=1).reset_index()

Output:

>>> new_df
   state office_id     sales    sales2    sales3
0     AZ         4  0.192500  0.443666  0.328069
1     AZ         2  0.169814  0.222423  0.244248
2     AZ         6  0.637686  0.333910  0.427683
3     AZ  subtotal  0.964236  0.636485  1.399280
4     CA         1  0.193319  0.350536  0.463970
5     CA         5  0.468094  0.295272  0.216932
6     CA         3  0.338587  0.354192  0.319098
7     CA  subtotal  1.007825  0.980298  1.011877
8     CO         3  0.198743  0.193194  0.434464
9     CO         1  0.368519  0.552873  0.442923
10    CO         5  0.432739  0.253933  0.122613
11    CO  subtotal  0.826401  1.364314  0.809285
12    WA         2  0.347072  0.201369  0.296343
13    WA         6  0.297815  0.314438  0.477587
14    WA         4  0.355113  0.484192  0.226070
15    WA  subtotal  0.844785  1.089840  1.065375

It's gnarly, so I'll provide a breakdown:

new_df = (df
    # This assign call selects all the sales columns, computes the percentages, and assigns them back to the dataframe
    .assign(**{
        k: list(v.values())
        for k,v
        in df[cols].apply(
            lambda c: c / c.groupby(df['state']).transform(sum)
        )
        .to_dict()
        .items()
    })
    .groupby('state')
    .apply(lambda x: x.append(
        pd.Series(
            {
                'office_id': 'subtotal',
                **(x[cols]
                    .sum(1)
                    .to_dict()
                )
            },
            name=''
        )
    ))
    .droplevel(1)
    .drop('state', axis=1)
    .reset_index()
)