1

I have a simple data frame with columns of nationality, occupation, and age. The nationalities are hot encoded 0,1,2 for EU, america, Asia.

For each occupation, I would like to find the percentage of each nationality For example: 67% of doctors are european, 33% are asian.

import pandas as pd
import numpy as np
#create dataframe
df=pd.DataFrame(np.concatenate((np.random.randint(low=0, high=3, size=   (10,1)),np.random.randint(low=24, high=70, size=(10,1))),axis=1))
df.columns=['nationality','age']
df['occupation']=['teacher']*2+['engineer']*3+['doctor']*3+['lawyer']*2


  nationality   age occupation
0   0   65  teacher
1   0   31  teacher
2   0   30  engineer
3   2   63  engineer
4   0   28  engineer
5   1   27  doctor
6   0   52  doctor
7   0   60  doctor
8   0   33  lawyer
9   0   38  lawyer

df.groupby(['occupation','nationality']).count()

def iseuropean(x):
    if x==0:
        return 1
    else:
        return 0
def isamerican(x):
    if x==1:
        return 1
    else:
        return 0
def isasian(x):
    if x==2:
        return 1
    else:
        return 0

With groupby I can get the counts, but I would like to apply a function per occupation group that determines the percentage. I haven't been able to figure it out, though.

Any help would be greatly appreciated.

user3177938
  • 435
  • 1
  • 5
  • 13
  • https://stackoverflow.com/questions/23377108/pandas-percentage-of-total-with-groupby – BENY Nov 12 '17 at 17:50

1 Answers1

2

I assume you're looking for the percentage of nationalities from each occupation:

In [11]: c = df.groupby(['occupation','nationality'])["age"].count().rename("count")

In [12]: c
Out[12]:
occupation  nationality
doctor      0              2
            1              1
engineer    0              2
            2              1
lawyer      0              2
teacher     0              2
Name: count, dtype: int64

In [13]: c / c.sum()  # proportion of each, maybe not very useful
Out[13]:
occupation  nationality
doctor      0              0.2
            1              0.1
engineer    0              0.2
            2              0.1
lawyer      0              0.2
teacher     0              0.2
Name: count, dtype: float64

In [14]: c / c.groupby(level=0).sum()  # proportion of each occupation
Out[14]:
occupation  nationality
doctor      0              0.666667
            1              0.333333
engineer    0              0.666667
            2              0.333333
lawyer      0              1.000000
teacher     0              1.000000
Name: count, dtype: float64

Aside you probably want to use Categorical codes rather than is_XXX:

In [21]: pd.Categorical.from_codes(df.nationality, ["european", "american", "asian"])
Out[21]:
[european, european, european, asian, european, american, european, european, european, european]
Categories (3, object): [european, american, asian]

In [22]: df.nationality = pd.Categorical.from_codes(df.nationality, ["european", "american", "asian"])

In [23]: df
Out[23]:
  nationality  age occupation
0    european   65    teacher
1    european   31    teacher
2    european   30   engineer
3       asian   63   engineer
4    european   28   engineer
5    american   27     doctor
6    european   52     doctor
7    european   60     doctor
8    european   33     lawyer
9    european   38     lawyer
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thank you very much Andy That worked perfectly! And also, thank you very much for the note on categorical. Really useful. Thanks again :-) – user3177938 Nov 12 '17 at 22:03