1

Below is the data for which I want to plot the PDF. https://gist.github.com/ecenm/cbbdcea724e199dc60fe4a38b7791eb8#file-64_general-out

Below is the script

import numpy as np
import matplotlib.pyplot as plt
import pylab

data = np.loadtxt('64_general.out')
H,X1 = np.histogram( data, bins = 10, normed = True, density = True) # Is this the right way to get the PDF ?
plt.xlabel('Latency')
plt.ylabel('PDF')
plt.title('PDF of latency values')

plt.plot(X1[1:], H)
plt.show()

When I plot the above, I get the following.

  1. Is the above the correct way to calculate the PDF of a range of values
  2. Is there any other way to confirm that the results I get is the actual PDF. For example, how can show the area under pdf = 1 for my case.

enter image description here

user2532296
  • 828
  • 1
  • 10
  • 27
  • Your data is made with integers only. Is this a discrete or continuous variable? Also take into account that PDF is "Probability Density Function". This means that for sparse data you are interpreting a "PDF" from it, not obtaining one. So, depending on your data, having 100 bins will beat 10 in terms of approximation (this is an example, don't take the numbers literally). – armatita Jun 22 '16 at 13:37
  • Thanks for the info, my data is discrete variable. I didn't understand your last sentence. Could you explain more ? – user2532296 Jun 22 '16 at 13:49
  • 1
    If it is a discrete variable than you are probably looking for a [PMF](https://en.wikipedia.org/wiki/Probability_mass_function) and my last comment won't apply. You can still use the histogram function to do it but you need to take into account that each bin should correspond to an unique value. See if the answer in this [question](http://stackoverflow.com/questions/30889444/python-matplotlib-probability-mass-function-as-histogram) helps you. – armatita Jun 22 '16 at 14:17

1 Answers1

1
  1. It is a legit way of approximating the PDF. Since np.histogram uses various techniques for binning the values you won't get the exact frequency of each number in your input. For a more exact approximation you should count the occurrence of each number and divide it by the total count. Also, since these are discrete values, the plot could be plotted as points or bars to give a more correct impression.

  2. In the discrete case, the sum of the frequencies should equal 1. In the continuous case you can for example use np.trapz() to approximate the integral.

user1337
  • 494
  • 3
  • 13