1

Suppose I have the following data:

df = pd.DataFrame({
   'state':['CA', 'WA', 'CA', 'AZ','AZ','WA','WA','CA'],
   'gender':['M','F','M','F', 'F','F','M','M']
})

What I am trying to do is create a dataframe via aggregation that shows the 'M'and 'F' breakdown by state. Right now I am able to get the frequency of each category pair which turns out to be:

df.groupby(['state', 'gender']).size()


state  gender
AZ     F         2
CA     M         3
WA     F         2
       M         1

My first question with this problem is how can I get the previous table to also show the fact that CA has 0 Fs and AZ has 0 Ms. Ideally I would like to have something that appears like the following:

state  gender
AZ     F         2
       M         0
CA     M         3
       F         0
WA     F         2
       M         1

Finally, what I am attempting to do with this is create percentage breakdowns based on frequency and the full total in the state. That would look like the following:

state  gender
AZ     F         100.00
       M         0.00
CA     M         100.00
       F         0.00
WA     F         66.66
       M         33.33

This comes close to what I need but it doesn't handle frequency and instances where a certain category is zero. Can anyone help?

Community
  • 1
  • 1
theamateurdataanalyst
  • 2,794
  • 4
  • 38
  • 72

1 Answers1

0

Well, it is far from elegant, but it does the trick I guess. Starting with your definition of the dataframe, this works for me:

import pandas as pd
import itertools


df = pd.DataFrame({
       'state':['CA', 'WA', 'CA', 'AZ','AZ','WA','WA','CA'],
       'gender':['M','F','M','F', 'F','F','M','M']
})

new_df = df.groupby(['state', 'gender']) \
    .size() \
    .reindex(list(itertools.product(set(df['state']),
                                    set(df['gender']))),
             fill_value=0) \
    .groupby(level=0) \
    .apply(lambda x: 100*x/float(x.sum()))

If anyone has a (more) elegant solution, I'd welcome it wholeheartedly!

Nelewout
  • 6,281
  • 3
  • 29
  • 39