1

This is a sample what my dataframe looks like:

company_name country_code state_code software finance commerce etc......
google       USA           CA          1        0          0
jimmy        GBR           unknown     0        0          1
microsoft    USA           NY          1        0          0

I want to get the average number of each industry in each state for example: I could have that 14% of the industry in CA is in software, 15% of the industry in CA is healthcare etc...

Obviously I need to get the total number of companies across all industries in each state and divide the number of companies in each individual industry by this to get the percentage of each industry in each state.

I just can't figure out a functioning way to do this.

Obviously I have tried using something like this in different ways, but to no avail:

new_df = df['state_code'].value_counts(normalize=True)

I want to get the sum of all the columns software, finance, commerce etc... and then give the percentage of each column when compared to the other columns.

Expected output:

State_Code software finance commerce etc..... 
CA           20%      10%     5%       65%
NY           10%      20%     10%      60%
AH           5%       5%      20%      70%

2 Answers2

2

I believe need first aggregate sum and then divide by div sum of columns per rows:

print (df)
  company_name country_code state_code  software  finance  commerce
0       google          USA         CA         1        0         4
1        jimmy          GBR    unknown         5        6         1
2    microsoft          USA         NY         1        0         0


#convert all columns without first to floats or ints
cols = df.columns.difference(['company_name', 'country_code', 'state_code'])
df[cols] = df[cols].astype(float)
#if not possible use astype because some non numeric values 
#df[cols] = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))

a = df.groupby(['state_code']).sum()
df = a.div(a.sum(axis=1), axis=0)
print (df)
            software  finance  commerce
state_code                             
CA          0.200000      0.0  0.800000
NY          1.000000      0.0  0.000000
unknown     0.416667      0.5  0.083333

If you need also percentage, multiply by 100 and if necessesary add round and cast to integers:

df = a.div(a.sum(1), axis=0).mul(100).round(0).astype(int)
print (df)
            software  finance  commerce
state_code                             
CA                20        0        80
NY               100        0         0
unknown           42       50         8

Last add percentage, but then values are not numeric, so no possible process later:

df = a.div(a.sum(1), axis=0).mul(100).round(0).astype(int).astype(str).add('%')
print (df)
           software finance commerce
state_code                          
CA              20%      0%      80%
NY             100%      0%       0%
unknown         42%     50%       8%
rafaelc
  • 57,686
  • 15
  • 58
  • 82
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for such an in-depth answer. Although, when I run this I just get 0% values for every column? –  Apr 11 '18 at 17:58
  • If check `print (df.dtypes)` get for last column `int` or `float`s? Or all values are `object`s, it means obviously numeric? – jezrael Apr 11 '18 at 18:00
  • also `print(df.groupby(['state_code']).sum())` working nice and get numeric values? – jezrael Apr 11 '18 at 18:03
  • All values are of type object. Still it is returning 0% for all columns. print(df.groupby(['state_code']).sum()) is working fine. –  Apr 11 '18 at 18:16
  • And this is reason. Gime me a sec. – jezrael Apr 11 '18 at 18:17
  • For some reason I am still getting 0% for every column? Thanks for your answer though, I will try figure it out further. –  Apr 12 '18 at 09:09
  • @Johnboy - Do you use `cols = df.columns.difference(['company_name', 'country_code', 'state_code']) df[cols] = df[cols].astype(float)` ? – jezrael Apr 12 '18 at 09:10
  • @Johnboy - And if it failed, do you try `cols = df.columns.difference(['company_name', 'country_code', 'state_code'])` `df[cols] = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))` Values in list `['company_name', 'country_code', 'state_code']` are all non numeric columns in your dataframe. – jezrael Apr 12 '18 at 09:27
  • I have tried both and commented out the other, but unfortunately neither worked. –  Apr 12 '18 at 09:33
0

The best way to do this is to put all the industry in an array. In my solution, I have called this testy.

First get the sum of all industries.

count = 0
for i in testy:
count += int(usa_df[i].sum())

Then divide this sum by the total of each industry and divide this by count and multiply by 100% This will then get you a percentage of each industry in the market.

for i in testy:
    tot = usa_df[i].sum()
    percent = (tot/count)*100
    print(i+" - "+str(percent)+"%"

The output will be as follows:

software - 20%
finance  - 30%
commerce - 10%
etc........ 
pandasman
  • 90
  • 18