20

plt.hist's density argument does not work.

I tried to use the density argument in the plt.hist function to normalize stock returns in my plot, but it didn't work.

The following code worked fine for me and give me the probability density function which I desired.

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 100  # mean of distribution
sigma = 15  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

num_bins = 50

plt.hist(x, num_bins, density=1)

plt.show()

plot shows density

But when I tried it with stock data, it simply didn't work. The result gave the unnormalized data. I didn't find any abnormal data in my data array.

import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
plt.hist(returns, 50,density = True)
plt.show()
# "returns" is a np array consisting of 360 days of stock returns

density not working

tdy
  • 36,675
  • 19
  • 86
  • 83
riversxiao
  • 369
  • 1
  • 2
  • 11
  • What does your actual data look like? – gmds Apr 07 '19 at 03:56
  • some thing like this : array([ 1.88179947e-02, -4.67532468e-03, 9.85850151e-03, 3.38807856e-03, 6.23819607e-03, 1.37640769e-02, -2.24416517e-03, -2.83400810e-02, -4.09722222e-02, -2.89645185e-03, -1.39191479e-02, 4.35743218e-03, 3.48304308e-03, -1.15698453e-02, 1.81123706e-02, 2.32361128e-02, 4.41750444e-02, 1.81231240e-03, 3.92334219e-02, 7.23494533e-03, 4.80665370e-03, 7.04111798e-03, 1.43040137e-02, -7.62997264e-03]) – riversxiao Apr 07 '19 at 03:58
  • I tried to convert the data type to float, but the result is still the same – riversxiao Apr 07 '19 at 03:59
  • What else do you expect the second graph to look like? – Sheldore Apr 07 '19 at 10:38
  • Both plots are correct in the sense that they are both normalized (= the area of the bars sums up to 1). Probably you just have a different idea of what you'd expect the `density` to be in mind? In that case I suppose this problem can only be solved if you tell people what that would be. – ImportanceOfBeingErnest Apr 07 '19 at 19:29
  • 1
    @ImportanceOfBeingErnest I assume that he expects to see the probability value for each bar on the vertical axis. In the bottom picture, you can see the value changes from 0 to 40. I suspect that he is expecting it to vary between 0 and 1. – Blade Jan 23 '20 at 17:54
  • I'm having the same problem, I'm expecting the values to vary between 0 and 1. Can someone explain in an answer what are the limits given by the Matplotlib graph? – MPA95 Apr 04 '20 at 23:08
  • Running into the same problem. The y-axis label should be the density of each bar. – Ethan Heilman Aug 28 '20 at 02:04
  • 2
    Does this answer your question? [pylab.hist(data, normed=1). Normalization seems to work incorrect](https://stackoverflow.com/questions/5498008/pylab-histdata-normed-1-normalization-seems-to-work-incorrect) – Arne Oct 26 '20 at 01:13

4 Answers4

9

This is a known issue in Matplotlib.

As stated in Bug Report: The density flag in pyplot.hist() does not work correctly

When density = False, the histogram plot would have counts on the Y-axis. But when density = True, the Y-axis does not mean anything useful. I think a better implementation would plot the PDF as the histogram when density = True.

The developers view this as a feature not a bug since it maintains compatibility with numpy. They have closed several the bug reports about it already with since it is working as intended. Creating even more confusion the example on the matplotlib site appears to show this feature working with the y-axis being assigned a meaningful value.

What you want to do with matplotlib is reasonable but matplotlib will not let you do it that way.

Ethan Heilman
  • 16,347
  • 11
  • 61
  • 88
2

It is not a bug. Area of the bars equal to 1. Numbers only seem strange because your bin sizes are small

0

Since this isn't resolved; based on @user14518925's response which is actually correct, this is treating bin width as an actual valid number whereas from my understanding you want each bin to have a width of 1 such that the sum of frequencies is 1. More succinctly, what you're seeing is:

\sum_{i}y_{i}\times\text{bin size} =1

Whereas what you want is:

\sum_{i}y_{i} =1

therefore, all you really need to change is the tick labels on the y-axis. One way to this is to disable the density option :

density = false

and instead divide by the total sample size as such (shown in your example):

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 0 # mean of distribution
sigma = 0.0000625  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

fig = plt.figure()
plt.hist(x, 50, density=False)
locs, _ = plt.yticks() 
print(locs)
plt.yticks(locs,np.round(locs/len(x),3))
plt.show()
tvbc
  • 33
  • 3
0

Another approach, besides that of tvbc, is to change the yticks on the plot.

import matplotlib.pyplot as plt
import numpy as np

steps = 10
bins = np.arange(0, 101, steps)
data = np.random.random(100000) * 100

plt.hist(data, bins=bins, density=True)
yticks = plt.gca().get_yticks()
plt.yticks(yticks, np.round(yticks * steps, 2))
plt.show()
Marco Wedemeyer
  • 366
  • 2
  • 11