-3

I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.

But I am more curious to know which distribution does the data carry within itself ?

I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.

Is any way to determine the distribution of the dataset ? Any suggestion appreciated.

Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example

EDIT:

The input has million rows and the short sample is given below

Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1

The frequency ranges from 1 to 20,000 count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.

Code:

import pandas
import matplotlib.pyplot as plt


df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()

Output Output of frequency count

Community
  • 1
  • 1
Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • 7
    Very first step: plot a histogram, and look at it :) – cel Mar 22 '16 at 10:25
  • @cel Thank you this is what I was looking for and my next doubt is do I sort the data as we do while plotting CDF and CCDF ? – Sitz Blogz Mar 22 '16 at 13:43
  • 1
    The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically: http://stats.stackexchange.com/questions/10517/identify-probability-distributions – Kobbe Mar 27 '16 at 14:49
  • [Here are all the `scipy.stats` distributions PDFs with example code.](http://stackoverflow.com/a/37559471/2087463) – tmthydvnprt Jun 01 '16 at 12:32

7 Answers7

6

This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogram function gives you the values at the two corners of the bin and you have to interpolate to get the center value.

import numpy as np
import matplotlib.pyplot as pl

x = np.random.randn(10000)

nbins = 20

n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
    pdfx[k] = 0.5*(bins[k]+bins[k+1])
    pdfy[k] = n[k]

pl.plot(pdfx, pdfy)

You can fit your data using the example shown in:

Fitting empirical distribution to theoretical ones with Scipy (Python)?

Community
  • 1
  • 1
Chiel
  • 6,006
  • 2
  • 32
  • 57
  • [Here are all the `scipy.stats` distributions PDFs with example code.](http://stackoverflow.com/a/37559471/2087463) – tmthydvnprt Jun 01 '16 at 12:32
4

Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.

Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.

In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):

import scipy.stats

[W, Pvalue] = scipy.stats.morestats.shapiro(x)

Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:

import numpy

[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))

Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.

I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot and not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:

import numpy as np 
import pylab 
import scipy.stats as stats

mydata = whatever data you are looking to fit to a distribution  
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()

Above you can substitute anything for dist='norm' from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats then find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab) or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.

And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)

Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.

Matt
  • 2,602
  • 13
  • 36
  • Matt Thank you so much for this wonderful detailed explanations. I was looking for details which I couldnt find while googling. It might a stats problem or programming problem but all in all its very important for people who are just now entering into data science and are experimenting on thr own. Very Much appreciated. Thanks again. – Sitz Blogz Apr 03 '16 at 11:00
  • [Here are all the `scipy.stats` distributions PDFs with example code.](http://stackoverflow.com/a/37559471/2087463) – tmthydvnprt Jun 01 '16 at 12:32
3

Did you try using the seaborn library? They have a nice kernel density estimation function. Try:

import seaborn as sns
sns.kdeplot(df['frequency'])

You find installation instructions here

Greg Friedman
  • 341
  • 4
  • 11
  • I have worked with seaborn but haven't checked that will check definitely and my data happens to be discrete in nature so will it be applicable? – Sitz Blogz Mar 28 '16 at 12:14
  • I did try to implement the solution you provided it looks good to me but when I am trying to implement the distribution plots in seaborn the kde plots well but wen switched to hist it would go to infinite loop and won't return anything any suggestions about that? – Sitz Blogz Apr 01 '16 at 05:15
  • @SitzBlogz If you have an issue with some code and you want help to debug it add it to your question or maybe better ask another question. – Stop harming Monica Apr 01 '16 at 07:29
2

The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array data you can compute the value of the empirical distribution function at x as the cumulative relative frequency of the values lesser than or equal to x:

d[d <= x].size / d.size

This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:

values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size

This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.

Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • Thank you for the detail explanation.. My data is real world twitter scrape keywords and the frequency of keywords looks more discreet to me. When plotted CDFs it shows more of steps than curve. – Sitz Blogz Mar 30 '16 at 19:39
2

The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically

  • My data is discreet in nature. Hence I am not able plot a histogram. But would seriously appreciate if someone could help with proper steps or code may be. With 5 ans above I am seriously confused. – Sitz Blogz Mar 31 '16 at 19:26
  • @SitzBlogz You already plotted an histogram, It shows a strong skewness to the left and a very long tail on the right. Using more bins could provide a better insigh. – Stop harming Monica Apr 01 '16 at 07:52
  • @Goyo I think this is what I was looking for someone to tell me what I plotted and now I know thank you so very much :) – Sitz Blogz Apr 01 '16 at 07:58
  • 1
    It would be nice to point out that you've just copied @Kobbe's comment as an answer. Also, I dont think this is correct. The OP wants to know the distribution ***of frequencies***, if there are three tags each with a freqeuncy of 1000... then the histogram should show *3*, **not** the sum (3000). – DilithiumMatrix Apr 02 '16 at 02:37
1

I think you are asking a slightly different question:

What is the correlation between my raw data and the curve to which I have mapped it?

This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guide to understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.

You were probably downvoted because it is a math question more than a programming question.

Charles Merriam
  • 19,908
  • 6
  • 73
  • 83
  • Thank you for answering. I got more doubts cleared with this. I agree it was may more of a stats question than a programming one. But I am trying to understand the distribution of real life discrete data. – Sitz Blogz Mar 31 '16 at 04:51
  • For one set of real data, draw scatter plot and look at it. For automating it, use your favorite curve fit algorithm for each type, e.g., different polynomials and calculate the correlations. Show the type of curve with the highest correlation. – Charles Merriam Apr 01 '16 at 03:15
1

I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.

Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:

   Red | ##
 Green | #####
  Blue | #######
Yellow | #####
Orange | ##

Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:

  Blue | #######
Yellow | #####
 Green | #####
Orange | ##
   Red | ##

I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.

Christian
  • 709
  • 3
  • 8
  • The data is discrete and not continues as any real world data be and I have to identify if it is skew or normal or what kind of distribution. It has to do nothing with what we have in column 1 as it will be any random keyword or name or anything all that matters is the column two where we have the counts of those keywords and put that count in some function to find the probability distribution function and future is to classify that data in some category. And if I am not wrong here you are suggesting me to sort the data and plot a histogram I did try that too and yet I am not able to get it. – Sitz Blogz Apr 01 '16 at 06:34
  • Consider this as data from Instagram where we find too many #keywords so I am trying to identify the pdf of those keywords may be in duration of one day or something. when I am plotting the CDFs I am getting steps instead of smooth curve. – Sitz Blogz Apr 01 '16 at 07:14
  • @SitzBlogz You can't *find* any distribution in your data other than the empirical one I mentioned in my answer. You can make the hypotesis that your data are random variates of some other distribution and test it, but that`s a different thing. – Stop harming Monica Apr 01 '16 at 07:24
  • @Goyo I did read about your suggestion regarding empirical and it is pretty convincing but again if you see here with all these suggestions I am a bit confused. – Sitz Blogz Apr 01 '16 at 07:29
  • @SitzBlogz You are getting confusing answers because your question is a bit confusing. So you want to "to know which distribution does the data carry within itself"? See my answer. Do you want to find a known distribution that fits your data? It is not clear to me, some people here interpreted you do. Do you have issues computing/plottling histograms, kde's...? There are some hints that this might be the case but they are quite vague. The answers can be just as good as the question and there is room for improvement in it. – Stop harming Monica Apr 01 '16 at 08:10
  • I must be misunderstanding then - are you looking at the distribution of strictly the frequency counts? So in the example you provided, do you just care about the distribution underlying the set [45, 4, 6, 1], irrespective of the associated hashtag? – Christian Apr 01 '16 at 14:02
  • Yes.. Column one can be of any type of keywords, random names or time series or geo location. The column two is the count of how many times is the respective keyword is posted in one day or one hour or few days.. – Sitz Blogz Apr 01 '16 at 17:00