1

I tried looking this up on other users' questions, but I don't think I have found an answer.

I am attempting to plot a histogram from some data I have stored in a Pandas dataframe, and I want the y-axis value of each bin to equal the probability of that bin's event occurring. Since the density=True argument of matplotlib.pyplot.hist divides the counts in a bin by total counts and by the bin size, for bins of size =/= 1, the y-axis value of the histogram doesn't equal the probability of the event happening in that bin. It instead equals the probability in that bin per unit in that bin. I wish to make my bins 10 units wide, which has lead to my issue.

My code to generate a Pandas dataframe with data similar to what I'm working with:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from random import seed
from random import randint

data = pd.DataFrame(columns=['Col1'])

i = 0
while i < 49500:
    data.loc[len(data.index)] = [0]
    i += 1

seed(1)
j = 0
while j < 500:
    data.loc[len(data.index)] = [randint(1,500)]
    j += 1

My code to plot the histogram:

plt.figure(2)
fig2, ax2 = plt.subplots()
ax2.hist(data['Col1'], range=(0.0, 500.0), bins=50, label='50000 numbers\n in 10 unit bins', density=True)
plt.title('Probability Density of Some Numbers from 0 to 500', wrap=True)
plt.legend(loc='upper right')
plt.yscale('log')
plt.xticks()
plt.minorticks_on()
plt.ylabel('Probability')
plt.xlabel('Number')
plt.savefig('randnum.png')

My histogram (note the 0-10 bin, while composing roughly 99% of the data, is only at a probability of 0.1):

Histogram with plt

I do realize that by making the y-axis probability not inversely proportional to bin size, the integral of the histogram no longer equals to 1 (it will equal to 10 in my case), but this is precisely what I am seeking.

Is there a way to either 1) change the value the histogram is normalized to or 2) directly multiply y-values of the histogram by a value of my choosing?

cameronpoe
  • 47
  • 4
  • 1
    You seem to be describing seaborn's `sns.histplot(..., stat='probability')`. See [docs](https://seaborn.pydata.org/generated/seaborn.histplot.html). Note that StackOverflow's guidelines require reproducible code and test data. Seaborn is a high-end interface to create statistical plots, based on matplotlib and pandas. – JohanC Jul 05 '22 at 18:56
  • @JohanC Please see my edits. I will try your seaborn solution and get back to you. – cameronpoe Jul 06 '22 at 15:21
  • @JohanC was able to get it working with plt, edited my original question to reflect that and your help! – cameronpoe Jul 06 '22 at 16:23
  • 1
    If you want to answer your own question, you should add that as a new answer. You can accept your answer (but not up-vote it). The post edits are meant to clarify the question, not to add the answer. – JohanC Jul 06 '22 at 19:42

1 Answers1

0

I was able to accomplish this in pyplot with help from @JohanC's reference to Seaborn. The terminology I was looking for is 'probability mass' (the histogram bar heights sum to 1). Using [this answer][2], I was able to properly plot my histogram. Below is my code and my new histogram:

plt.figure(2)
fig2, ax2 = plt.subplots()
weights = np.ones_like(data['Col1']) / len(data['Col1'])
ax2.hist(data['Col1'], range=(0.0, 500.0), weights=weights, bins=50, label='50000 numbers\n in 10 unit bins')
plt.title('Probability Density of Some Numbers from 0 to 500', wrap=True)
plt.legend(loc='upper right')
plt.yscale('log')
plt.xticks()
plt.minorticks_on()
plt.ylabel('Probability')
plt.xlabel('Number')
plt.savefig('randnum.png')

enter image description here

cameronpoe
  • 47
  • 4