0

I have a DataFrame of the following form:

>>> sales = pd.DataFrame({'seller_id':list('AAAABBBB'),'buyer_id':list('CCDECDEF'),\
                          'amount':np.random.randint(10,20,size=(8,))})
>>> sales = sales[['seller_id','buyer_id','amount']]
>>> sales
  seller_id buyer_id  amount
0         A        C      18
1         A        C      15
2         A        D      11
3         A        E      12
4         B        C      16
5         B        D      18
6         B        E      16
7         B        F      19

Now what I would like to do is for each seller calculate the share of total sale amount taken up by its largest buyer. I have code that does this, but I have to keep resetting the index and grouping again, which is wasteful. There has to be a better way. I would like a solution where I can aggregate one column at a time and keep the others grouped. Here's my current code:

>>> gr2 = sales.groupby(['buyer_id','seller_id'])
>>> seller_buyer_level = gr2['amount'].sum() # sum over different purchases
>>> seller_buyer_level_reset = seller_buyer_level.reset_index('buyer_id')
>>> gr3 = seller_buyer_level_reset.groupby(seller_buyer_level_reset.index)
>>> result = gr3['amount'].max() / gr3['amount'].sum()

>>> result
seller_id
A    0.589286
B    0.275362

I simplified a bit. In reality I also have a time period column, and so I want to do this at the seller and time period level, that's why in gr3 I'm grouping by the multi-index (in this example, it appears as a single index). I thought there would be a solution where instead of reducing and regrouping I would be able to aggregate only one index out of the group, leaving the others grouped, but couldn't find it in the documentation or online. Any ideas?

ErnestScribbler
  • 2,667
  • 1
  • 18
  • 13

2 Answers2

0

Here's a one-liner, but it resets the index once, too:

sales.groupby(['seller_id','buyer_id']).sum().\
    reset_index(level=1).groupby(level=0).\
    apply(lambda x: x.amount.max()/x.amount.sum())
#seller_id
#A    0.509091
#B    0.316667
#dtype: float64
DYZ
  • 55,249
  • 10
  • 64
  • 93
0

I would do this using pivot_table and then broadcasting (see What does the term "broadcasting" mean in Pandas documentation?).

First, pivot the data with seller_id in the index and buyer_id in the columns:

sales_pivot = sales.pivot_table(index='seller_id', columns='buyer_id', values='amount', aggfunc='sum')

Then, divide the values in each row by the sum of said row:

result = sales_pivot.div(sales_pivot.sum(axis=1), axis=0)

Lastly, you can call result.max(axis=1) to see the top share for each seller.

Community
  • 1
  • 1
cmaher
  • 5,100
  • 1
  • 22
  • 34
  • Thanks! But this would create a table of size n_buyers * n_sellers, which might work in this toy example but would never fit in memory in my real dataset. – ErnestScribbler Feb 15 '17 at 08:29