0

I have a pandas DataFrame containing the percentage of students that have a certain skill in each subject stratified according to their gender

iterables = [['Above basic','Basic','Low'], ['Female','Male']]
index = pd.MultiIndex.from_product(iterables, names=["Skills", "Gender"])
df = pd.DataFrame(data=[[36,36,8,8,6,6],[46,46,2,3,1,2],[24,26,10,11,16,13]], index=["Math", "Literature", "Physics"], columns=index)
print(df)

       Skill       Above basic    Basic          Low     
       Gender    Female Male   Female Male   Female Male
Math                36   36      8    8        6    6
Literature          46   46      2    3        1    2
Physics             24   26     10   11       16   13

Next I want to see how the skills are distributed according to the subjects

#plot how the skills are distributed according to the subjects
df.sum(axis=1,level=[0]).plot(kind='bar')
df.plot(kind='bar')

enter image description here

Now I would like to add the percentage of Male and Female to each bar in a stacked manner.. eg. for the fist bar ("Math", "Above basic") it should be 50/50. For the bar ("Literature", "Basic") it should be 40/60, for the bar ("Literature","Low") it should be 33.3/66.7 and so on...

Could you give me a hand?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
gabboshow
  • 5,359
  • 12
  • 48
  • 98
  • 1
    Maybe https://stackoverflow.com/questions/59922701/how-can-i-group-a-stacked-bar-chart ? – JohanC Feb 14 '23 at 19:05
  • 1
    Using the level keyword in DataFrame and Series aggregations is deprecated `df.sum(axis=1,level=[0])`. Use `df.groupby(level=0, axis=1).sum()` instead. – Trenton McKinney Feb 14 '23 at 19:50

1 Answers1

0
  • Using the level keyword in DataFrame and Series aggregations, df.sum(axis=1,level=[0]), is deprecated.
    • Use df.groupby(level=0, axis=1).sum()
  • df.div(dfg).mul(100).round(1).astype(str) creates a DataFrame of strings with the 'Female' and 'Male' percent for each of the 'Skills', which can be used to create a custom bar label.
  • As shown in this answer, use matplotlib.pyplot.bar_label to annotate the bars, which has a labels= parameter for custom labels.
  • Tested in python 3.11, pandas 1.5.3, matplotlib 3.7.0, seaborn 0.12.2
# group df to create the bar plot
dfg = df.groupby(level=0, axis=1).sum()

# calculate the Female / Male percent for each Skill
percent_s = df.div(dfg).mul(100).round(1).astype(str)

# plot the bars
ax = dfg.plot(kind='bar', figsize=(10, 7), rot=0, width=0.9, ylabel='Total Percent\n(Female/Male split)')

# iterate through the bar containers
for c in ax.containers:
    # get the Skill label
    label = c.get_label()
    # use the Skill label to get the current group based on level, join the strings,and get an array of custom labels
    labels = percent_s.loc[:, percent_s.columns.get_level_values(0).isin([label])].agg('/'.join, axis=1).values
    # add the custom labels to the center of the bars
    ax.bar_label(c, labels=labels, label_type='center')
    # add total percent to the top of the bars
    ax.bar_label(c, weight='bold', fmt='%g%%')

enter image description here

percent_s

Skills     Above basic        Basic          Low      
Gender          Female  Male Female  Male Female  Male
Math              50.0  50.0   50.0  50.0   50.0  50.0
Literature        50.0  50.0   40.0  60.0   33.3  66.7
Physics           48.0  52.0   47.6  52.4   55.2  44.8

# melt df into a long form
dfm = df.melt(ignore_index=False).reset_index(names='Subject')

# plot the melted dataframe
g = sns.catplot(kind='bar', data=dfm, x='Subject', y='value', col='Gender', hue='Skills')

# Flatten the axes for ease of use
axes = g.axes.ravel()

# relabel the yaxis
axes[0].set_ylabel('Percent')

# add bar labels
for ax in axes:
    for c in ax.containers:
        ax.bar_label(c, fmt='%0.1f%%')

enter image description here

  • Or swap x= and col= to col='Subject' and x='Gender'.

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158