Stacked bar plots with some missing values and many indices

Question

I have a mass DataFrame df (sorted by 'year'):

year       gender
1894       male
1895       male
1895       male
1896       male
1900       male
...
2008       male
2008       female
2009       male
2009       female
2009       female

and I aim to make a stacked bar chart with the x-axis 'year' and the y-axis the number of occurrences of those year values, with ['gender'] == 'female' on top of ['gender'] == 'male' on each bar.

I tried the following:

import plotly.express as px

df['freq'] = df.groupby('year')['gender'].transform('count')

fig = px.bar(df, x="year", y="freq", color='gender')
fig.show()

However, this takes up too much runtime and returns a blank graph. So, instead of creating a stacked bar chart using plotly, I attempted utilizing matplotlib:

import matplotlib.pyplot as plt

df_male = df[df['gender'] == 'male']
df_female = df[df['gender'] == 'female']

X = range(1894, 2010)

plt.bar(X, df_male['year'], color = 'b')
plt.bar(X, df_female['year'], color = 'r', bottom = df_male['year'])
plt.show()

But this returns ValueError: shape mismatch: objects cannot be broadcast to a single shape, which I wonder if this is due to the fact that there are some years between 1894 and 2009 in df that do not exist (e.g. 1897, 1898, 1899, etc.).

Any insights to help me go further would be appreciated.

@user_na that could be it. They are very uneven: `df_male` has a length of approximately 485,000, where `df_female` has about 45,000. But wouldn't there be a way to stack them nonetheless by their frequencies per year? — jstaxlin, May 13 '21 at 08:04
It seems that you are missing the step to create a histogram by year in the filtered arrays. You will need a histogram of the year column of those two. see https://stackoverflow.com/questions/13129618/histogram-values-of-a-pandas-series — user_na, May 13 '21 at 08:20

JohanC · Accepted Answer · 2021-05-13T09:44:19.690

The easiest solution would be seaborn 0.11's histplot:

import seaborn as sns
import pandas as pd
import numpy as np

df = pd.DataFrame({'year': np.random.randint(1894, 2010, 200),
                   'gender': np.random.choice(['male', 'female'], 200)})
sns.histplot(data=df, x='year', hue='gender', discrete=True, multiple='stack')

Another option would be to create the grouped dataframe as follows and then use pandas' plotting:

df.groupby(['year', 'gender']).size().unstack().plot.bar(stacked=True)

Here df.groupby(['year', 'gender']).size() creates a series using year and gender as index. unstack() converts the gender index to a dataframe with two columns. The unstacked dataframe could also be sent to plotly. It looks like:

gender  female  male
year                
1894       1.0   3.0
1895       1.0   4.0
1896       NaN   1.0
1897       NaN   2.0
....

Stacked bar plots with some missing values and many indices

1 Answers1