0

I have gone from different posts on this forum, but I cannot find an answer to the behaviour I am seeing.

I have a csv file which header has many entries with 300 points each. For each fiel (column of the csv file) I would like to plot an histogram. The x axis contains the elements on that column and the y-axis should have the number of samples that fall inside each bin. As I have 300 points, the total number of samples in all bins added together should be 300, so the y-axis should go from 0 to, let's say, 50 (just an example). However, the values are gigantic (400e8), which makes not sense.

sample of the table point mydata

1 | 250.23e-9 2 | 250.123e-9 ... | ... 300 | 251.34e-9 enter image description here

Please check my code, below. I am using pandas to open the csv and Matplotlib for the rest.

df=pd.read_csv("/home/pcardoso/raw_data/myData.csv")

# Figure parameters
figPath='/home/pcardoso/scripts/python/matplotlib/figures/'
figPrefix='hist_'           # Prefix to the name of the file.
figSuffix='_something'      # Suffix to the name of the file.
figString=''    # Full string passed as the figure name to be saved

precision=3
num_bins = 50

columns=list(df)

for fieldName in columns:

    vectorData=df[fieldName]
    
    # statistical data
    mu = np.mean(vectorData)  # mean of distribution
    sigma = np.std(vectorData)  # standard deviation of distribution

    # Create plot instance
    fig, ax = plt.subplots()

    # Histogram
    n, bins, patches = ax.hist(vectorData, num_bins, density='True',alpha=0.75,rwidth=0.9, label=fieldName)
    ax.legend()
    
    # Best-fit curve
    y=mlab.normpdf(bins, mu, sigma)
    ax.plot(bins, y, '--')
    
    # Setting axis names, grid and title
    ax.set_xlabel(fieldName)
    ax.set_ylabel('Number of points')
    ax.set_title(fieldName + ': $\mu=$' + eng_notation(mu,precision) + ', $\sigma=$' + eng_notation(sigma,precision))
    ax.grid(True, alpha=0.2)
    
    fig.tight_layout()      # Tweak spacing to prevent clipping of ylabel
    
    # Saving figure
    figString=figPrefix + fieldName +figSuffix
    fig.savefig(figPath + figString)

plt.show()

plt.close(fig)

In summary, I would like to know how to have the y-axis values right.

Edit: 6 July 2020

Plot of ibis in nA, with the density curve on top of the histogram

Edit 08 June 2020 I would like the density estimator to follow the plot like this:

enter image description here

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Does this answer your question? [Is there a parameter in matplotlib/pandas to have the Y axis of a histogram as percentage?](https://stackoverflow.com/questions/17874063/is-there-a-parameter-in-matplotlib-pandas-to-have-the-y-axis-of-a-histogram-as-p) – Trenton McKinney Jul 04 '20 at 20:47
  • Thanks for your question. I had come across this post, already, and it doesn't do what I want. Thanks anyway. ;-) – Pedro Cardoso Jul 05 '20 at 13:27

1 Answers1

1

Don't use density='True', as with that option, the value displayed is the members in the bin divided by the width of the bin. If that width is small (as in your case of rather small x-values, the values become large.

Edit: Ok, to un-norm the normed curve, you need to multiply it with the number of points and the width of one bin. I made a more reduced example:

from numpy.random import normal
from scipy.stats import norm
import pylab

N = 300
sigma = 10.0
B = 30

def main():
    x = normal(0, sigma, N)

    h, bins, _ = pylab.hist(x, bins=B, rwidth=0.8)
    bin_width = bins[1] - bins[0]

    h_n = norm.pdf(bins[:-1], 0, sigma) * N * bin_width
    pylab.plot(bins[:-1], h_n)

if __name__ == "__main__":
    main()
Dr. V
  • 1,747
  • 1
  • 11
  • 14
  • Thanks for your answer, removing density solved the problem. Strangely, setting density to 'False' doesn't do anything. But now, how can I plot a density curve on top of the histogram. The way I have in the code, it will use the same huge scale for that. How can I force to both histogram and density plot to use the same scale? – Pedro Cardoso Jul 05 '20 at 13:26
  • Haha, that's a bug in itself: `density='True'` worked by coincidence, as the string `'True'` is not empty and casts to `True` as `boolean`, but so does `'False'` or `density='Bazinga'`. Try `density=False`. Well, density is per definition such that the surface below the histogram is one. The only way to get density into similar scales is to normalise the `x`-axis, i.e. divide all `x`-values by the interval `max(x) - min(x)`. – Dr. V Jul 05 '20 at 18:12
  • Hi, I guess what I am expecting is more an envelope line, rather then a density plot. – Pedro Cardoso Jul 06 '20 at 08:39
  • As you can see, om my post(I've edited it),I have small values on the x-axis (10^-9) and the y-axis is just the number of points.The number of points/samples on the y-axis, should be fixed no matter the values on x-axis. Can you, please, help me with this? – Pedro Cardoso Jul 06 '20 at 08:46
  • 1
    Now I edited my answer on how to un-norm the normed curve. – Dr. V Jul 06 '20 at 18:39
  • Thanks for your inputs. That's quite interesting, but this approach forces the code to always fit a Gaussian distribution. I would like to have something, which could follow the data, like the one on the plot that I've added to my post. Thanks, Pedro – Pedro Cardoso Jul 08 '20 at 09:19
  • 1
    The scaling method remains the same even if you have any other distribution. However, if you fit your own elephant to the data, you don't need to go via normalization at all. You can fit any curve to the histogram data `bins[:-1] -> h` and plot it as any other function. – Dr. V Jul 08 '20 at 15:09