2

Newbie trying to break my addiction to excel. I have a data set of paid invoices with the vendor and country where it was paid along with the amount. I want know for each vendor, which country they have the greatest invoice amount and what percentage of their total business is in that country. Using this data set I want the result to be:

Desired output

import pandas as pd
import numpy as np
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
    'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
    'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
    'Pct' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
CoCntry = df.groupby(['Company', 'Country'])
CoCntry.aggregate(np.sum)

After looking at multiple examples including: Extract row with max value and Getting max value using groupby

2: Python : Getting the Row which has the max value in groups using groupby I've gotten as far as creating a DataFrameGroupBy summarizing the invoice data by country. I’m struggling with how to find the max row. After which I must figure out how to calculate the percent. Advice welcome.

Community
  • 1
  • 1
jones5322
  • 23
  • 3

2 Answers2

2

You can use transform for return Series Pct of summed values per groups by first level Company. Then filter Dataframe by max value per groups with idxmax and last divide Amount column with Series Pct:

g = CoCntry.groupby(level='Company')['Amount']
Pct = g.transform('sum')
print (Pct)
Company  Country
bar      one        25
         three      25
         two        25
foo      one        28
         three      28
         two        28
Name: Amount, dtype: int64

CoCntry  = CoCntry.loc[g.idxmax()]
print (CoCntry)
                 Amount  Pct
Company Country             
bar     one          11    0
foo     two          11    0

CoCntry.Pct = CoCntry.Amount.div(Pct)
print (CoCntry.reset_index())
  Company Country  Amount       Pct
0     bar     one      11  0.440000
1     foo     two      11  0.392857

Similar another solution:

CoCntry = df.groupby(['Company', 'Country']).Amount.sum()
print (CoCntry)
Company  Country
bar      one        11
         three       4
         two        10
foo      one         9
         three       8
         two        11
Name: Amount, dtype: int64

g =  CoCntry.groupby(level='Company')
Pct = g.sum()
print (Pct)
Company
bar    25
foo    28
Name: Amount, dtype: int64

maxCoCntry  = CoCntry.loc[g.idxmax()].to_frame()
maxCoCntry['Pct'] = maxCoCntry.Amount.div(Pct, level=0)
print (maxCoCntry.reset_index())

  Company Country  Amount       Pct
0     bar     one      11  0.440000
1     foo     two      11  0.392857
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I'm not sure why, but the statement "g = CoCntry.groupby(level='Company')['Amount']" triggers the error, "ttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy' objects, try using the 'apply' method" – jones5322 Jan 22 '17 at 22:49
  • The second solution works great. Thank you so much. – jones5322 Jan 22 '17 at 22:50
  • @AlbertJones - I am not sure what is problem, maybe need upgrade pandas - in `0.19.2` it works perfect. – jezrael Jan 22 '17 at 22:52
  • The second solution is what I used. I did not realize it, but a few of my records had $0 invoice amounts and your second solution was able to handle that. – jones5322 Jan 23 '17 at 00:37
2

setup

df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
    'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
    'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
    })

solution

# sum total invoice per country per company
comp_by_country = df.groupby(['Company', 'Country']).Amount.sum()

# sum total invoice per company
comp_totals = df.groupby('Company').Amount.sum()

# percent of per company per country invoice relative to company
comp_by_country_pct = comp_by_country.div(comp_totals).rename('Pct')

answer to OP question
Which 'Country' has greatest total invoice for 'Company' and what percentage of that companies total business.

comp_by_country_pct.loc[
    comp_by_country_pct.groupby(level=0).idxmax()
].reset_index()

  Company Country       Pct
0     bar     one  0.440000
1     foo     two  0.392857

piRSquared
  • 285,575
  • 57
  • 475
  • 624