I have a simple data frame with columns of nationality, occupation, and age. The nationalities are hot encoded 0,1,2 for EU, america, Asia.
For each occupation, I would like to find the percentage of each nationality For example: 67% of doctors are european, 33% are asian.
import pandas as pd
import numpy as np
#create dataframe
df=pd.DataFrame(np.concatenate((np.random.randint(low=0, high=3, size= (10,1)),np.random.randint(low=24, high=70, size=(10,1))),axis=1))
df.columns=['nationality','age']
df['occupation']=['teacher']*2+['engineer']*3+['doctor']*3+['lawyer']*2
nationality age occupation
0 0 65 teacher
1 0 31 teacher
2 0 30 engineer
3 2 63 engineer
4 0 28 engineer
5 1 27 doctor
6 0 52 doctor
7 0 60 doctor
8 0 33 lawyer
9 0 38 lawyer
df.groupby(['occupation','nationality']).count()
def iseuropean(x):
if x==0:
return 1
else:
return 0
def isamerican(x):
if x==1:
return 1
else:
return 0
def isasian(x):
if x==2:
return 1
else:
return 0
With groupby I can get the counts, but I would like to apply a function per occupation group that determines the percentage. I haven't been able to figure it out, though.
Any help would be greatly appreciated.