-1

I wrote a (newbie) python function (below) to draw a bar chart broken out by a primary and possibly a secondary dimension. For example, the image below charts the percentage of people in each gender who have attained a specific level of education.

Question: how do I overlay on each bar the median household size for that subgroup e.g. place a point signifying the value '3' on the College/Female bar. None of the examples I have seen accurately overlay the point on the correct bar.

I'm extremely new to this, so thank you very much for your help!

df = pd.DataFrame({'Student'       : ['Alice', 'Bob', 'Chris',  'Dave',    'Edna',    'Frank'], 
                   'Education'     : ['HS',    'HS',  'HS',     'College', 'College', 'HS'   ],
                   'Household Size': [4,        4,     3,        3,         3,         6     ],
                   'Gender'        : ['F',     'M',   'M',      'M',       'F',       'M'    ]});


def MakePercentageFrequencyTable(dataFrame, primaryDimension, secondaryDimension=None, extraAggregatedField=None):
    lod = dataFrame.groupby([secondaryDimension]) if secondaryDimension is not None else dataFrame

    primaryDimensionPercent = lod[primaryDimension].value_counts(normalize=True) \
                         .rename('percentage') \
                         .mul(100) \
                         .reset_index(drop=False);

    if secondaryDimension is not None:
        primaryDimensionPercent = primaryDimensionPercent.sort_values(secondaryDimension)
        g = sns.catplot(x="percentage", y=secondaryDimension, hue=primaryDimension, kind='bar', data=primaryDimensionPercent)
    else:
        sns.catplot(x="percentage", y='index', kind='bar', data=primaryDimensionPercent)
        
MakePercentageFrequencyTable(dataFrame=df,primaryDimension='Education', secondaryDimension='Gender')

# Question: I want to send in extraAggregatedField='Household Size' when I call the function such that 
# it creates a secondary 'Household Size' axis at the top of the figure
# and aggregates/integrates the 'Household Size' column such that the following points are plotted
# against the secondary axis and positioned over the given bars:
#
# Female/College => 3
# Female/High School => 4
# Male/College => 3
# Male/High School => 4

Picture of what I have been able to achieve so far

dnb
  • 3
  • 3
  • Welcome to Stack Overflow! Please take a moment to read [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask). You need to provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) that includes a toy dataset (refer to [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) – Diziet Asahi Sep 18 '20 at 22:04
  • Thank you @DizietAsahi. I rewrote the question to be locally reproducible and hopefully clearer. – dnb Sep 18 '20 at 23:19

1 Answers1

0

You will have to use the axes-level functions sns.barplot() and sns.stripplot() rather than catplot(), which creates a new figure and a FacetGrid.

Something like this:

df = pd.DataFrame({'Student'       : ['Alice', 'Bob', 'Chris',  'Dave',    'Edna',    'Frank'], 
                   'Education'     : ['HS',    'HS',  'HS',     'College', 'College', 'HS'   ],
                   'Household Size': [4,        4,     3,        3,         3,         6     ],
                   'Gender'        : ['F',     'M',   'M',      'M',       'F',       'M'    ]});


def MakePercentageFrequencyTable(dataFrame, primaryDimension, secondaryDimension=None, extraAggregatedField=None, ax=None):
    ax = plt.gca() if ax is None else ax
    lod = dataFrame.groupby([secondaryDimension]) if secondaryDimension is not None else dataFrame

    primaryDimensionPercent = lod[primaryDimension].value_counts(normalize=True) \
                         .rename('percentage') \
                         .mul(100) \
                         .reset_index(drop=False);

    if secondaryDimension is not None:
        primaryDimensionPercent = primaryDimensionPercent.sort_values(secondaryDimension)
        ax = sns.barplot(x="percentage", y=secondaryDimension, hue=primaryDimension, data=primaryDimensionPercent, ax=ax)
    else:
        ax = sns.barplot(x="percentage", y='index', data=primaryDimensionPercent, ax=ax)
    
    if extraAggregatedField is not None:
        ax2 = ax.twiny()
        extraDimension = dataFrame.groupby([primaryDimension, secondaryDimension]).mean().reset_index(drop=False)
        ax2 = sns.stripplot(data=extraDimension, x=extraAggregatedField, y=secondaryDimension, hue=primaryDimension, 
                            ax=ax2,dodge=True, edgecolors='k', linewidth=1, size=10)


plt.figure()
MakePercentageFrequencyTable(dataFrame=df,primaryDimension='Education', secondaryDimension='Gender', extraAggregatedField='Household Size')

enter image description here

Diziet Asahi
  • 38,379
  • 7
  • 60
  • 75