1

I am getting a mongodb query (not sure whether that bit is relevant) and trying to plot a stacked histogram based the values of one of the columns.

cursor = db.gadgets.find()
df = pd.DataFrame(cursor)
df['created'].hist(by=df['gadgetTypeId'], sharex=True, sharey=True, figsize=(16,10), legend=True)

So far so good, but it creates a bunch of subplots. I want one stacked subplot.

Tried to get this using matplotlib.pyplot.hist(), but couldn't find how to use the by argument.

Irina Rapoport
  • 1,404
  • 1
  • 20
  • 37
  • Does https://stackoverflow.com/questions/24594511/creating-stacked-histogram-with-pandas-dataframes-data-python answer the question? Or https://stackoverflow.com/questions/58414786/stacked-histogram-by-decade-from-dataframe? I'm not sure I understood the question, but those are the first two results I get if I try putting `stacked histogram from dataframe` [into a search engine](https://duckduckgo.com/?q=stacked+histogram+from+dataframe). – Karl Knechtel Sep 01 '22 at 04:45
  • Both are sort of half-way answers for me. The first uses multiple DataFrames as an input. I know I could split mine into one DataFrame per gadgetId, but my hunch is, there has to be an easier way. The second has pivot tables, and yes it's a solution, but to me, it's more complex than the first one. Because then I'd lose data from the original table, would have to add it back, the whole thing is a bit cumbersome. I am thinking there has to be an easy way, because people combine charts all the time, it's an incredibly common operation. – Irina Rapoport Sep 01 '22 at 04:52

1 Answers1

0

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "gadgetTypeId": [1, 2, 3, 3, 1, 2, 2, 1, 2, 1],
        "created": [2017, 2018, 2018, 2019, 2017, 2018, 2017, 2017, 2019, 2019],
    }
)

Here is one way to do it:

import matplotlib.pyplot as plt

plt.hist(
    x=[df.loc[df["gadgetTypeId"] == i, "created"] for i in df["gadgetTypeId"].unique()],
    stacked=True,
    color=["r", "g", "b"],
    label=df["gadgetTypeId"].unique(),
)
plt.xticks(ticks=df["created"])
plt.xlabel("created")
plt.yticks(ticks=range(df["created"].value_counts().max() + 1))
plt.ylabel("count")
plt.legend()

Which outputs:

enter image description here

You could also consider using Seaborn, which is less verbose while better looking:

import seaborn as sns

ax = sns.histplot(
    data=df,
    stat="count",
    multiple="stack",
    x="created",
    kde=False,
    palette="tab10",
    hue="gadgetTypeId",
    element="bars",
    legend=True,
)

ax.set_xticks(ticks=df["created"])
ax.set_yticks(ticks=range(df["created"].value_counts().max() + 1))

Which gets you:

enter image description here

Laurent
  • 12,287
  • 7
  • 21
  • 37