111

How would you create a qq-plot using Python?

Assuming that you have a large set of measurements and are using some plotting function that takes XY-values as input. The function should plot the quantiles of the measurements against the corresponding quantiles of some distribution (normal, uniform...).

The resulting plot lets us then evaluate in our measurement follows the assumed distribution or not.

http://en.wikipedia.org/wiki/Quantile-quantile_plot

Both R and Matlab provide ready made functions for this, but I am wondering what the cleanest method for implementing in in Python would be.

John
  • 1,721
  • 3
  • 15
  • 15
  • 2
    Have you looked at `probplot`? http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html – Geoff Dec 13 '12 at 17:58
  • 1
    qqplot and probplots with lots of options: http://statsmodels.sourceforge.net/devel/graphics.html#goodness-of-fit-plots – Josef Dec 13 '12 at 19:00

9 Answers9

129

Update: As folks have pointed out this answer is not correct. A probplot is different from a quantile-quantile plot. Please see those comments and other answers before you make an error in interpreting or conveying your distributions' relationship.

I think that scipy.stats.probplot will do what you want. See the documentation for more detail.

import numpy as np 
import pylab 
import scipy.stats as stats

measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()

Result

enter image description here

Geoff
  • 7,935
  • 3
  • 35
  • 43
  • Sometimes I have seen some dotted confidence lines that narrows at the middle and is like a trumpet in the ends. Can you add these "guide lines" to the plot? – Norfeldt Aug 13 '13 at 14:21
  • 25
    Ok, but this is a probability plot (a sample vs a theoretical distribution). A qq plot compares two samples. http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm http://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htm – Ricky Robinson Apr 15 '14 at 09:06
  • 7
    @RickyRobinson It seems that many sources (including wikipedia) contradict the NIST handbook. Pretty much any other source states that a QQ plot has theoretical quantiles on the horizontal axis, and data quantiles vertically. In any case, the distinction is academic: plotting a sample is essentially the same as using the empirical distribution function. Either way, you're plotting one dsitribution's quantiles against another. – Peter Aug 02 '15 at 15:35
  • 1
    I agree with @RickyRobinson, this is not the correct answer to this question. QQ plots and prob plots are different even though they both one distribution's quantiles against another. – Florent Dec 17 '18 at 18:29
  • From the documentation: "probplot generates a probability plot, which should not be confused with a Q-Q or a P-P plot." – ady May 06 '21 at 13:13
68

Using qqplot of statsmodels.api is another option:

Very basic example:

import numpy as np
import statsmodels.api as sm
import pylab

test = np.random.normal(0,1, 1000)

sm.qqplot(test, line='45')
pylab.show()

Result:

enter image description here

Documentation and more example are here

Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 2
    @tommy.carstensen it was deliberately separated from `scipy` to `statsmodels` – SARose Jan 27 '17 at 20:09
  • 14
    Just a note. Your example draws the line for standard normal distribution. To get a standardized line (scaled by the standard deviation of the given sample and have the mean added) like in @Geoff example, you need to set line='s' instead of line='45' – Mike May 16 '17 at 09:41
  • +1 for this answer. I think it is important to focus more resources on a single package for statistics. `statsmodels` would be a good choice. – Ken T Apr 29 '18 at 14:47
23

If you need to do a QQ plot of one sample vs. another, statsmodels includes qqplot_2samples(). Like Ricky Robinson in a comment above, this is what I think of as a QQ plot vs a probability plot which is a sample against a theoretical distribution.

http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot_2samples.html

ccap
  • 331
  • 2
  • 3
  • 12
    This qqplot implementation does not seem to handle samples with different sizes, which is funny because one of the big advantages of a Q-Q plot is that one can compare samples with different sizes... – Robert Muil Sep 15 '14 at 10:42
7

I came up with this. Maybe you can improve it. Especially the method of generating the quantiles of the distribution seems cumbersome to me.

You could replace np.random.normal with any other distribution from np.random to compare data against other distributions.

#!/bin/python

import numpy as np

measurements = np.random.normal(loc = 20, scale = 5, size=100000)

def qq_plot(data, sample_size):
    qq = np.ones([sample_size, 2])
    np.random.shuffle(data)
    qq[:, 0] = np.sort(data[0:sample_size])
    qq[:, 1] = np.sort(np.random.normal(size = sample_size))
    return qq

print qq_plot(measurements, 1000)
John
  • 1,721
  • 3
  • 15
  • 15
  • Why do you select a random sample_size subset from data and not compare random variates from your theoretical distribution of the entire size of measurements against all measurements? – pas-calc Jul 12 '23 at 21:33
4

To add to the confusion around Q-Q plots and probability plots in the Python and R worlds, this is what the SciPy manual says:

"probplot generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, see statsmodels.api.ProbPlot."

If you try out scipy.stats.probplot, you'll see that indeed it compares a dataset to a theoretical distribution. Q-Q plots, OTOH, compare two datasets (samples).

R has functions qqnorm, qqplot and qqline. From the R help (Version 3.6.3):

qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

qqplot produces a QQ plot of two datasets.

In short, R's qqnorm offers the same functionality that scipy.stats.probplot provides with the default setting dist=norm. But the fact that they called it qqnorm and that it's supposed to "produce a normal QQ plot" may easily confuse users.

Finally, a word of warning. These plots don't replace proper statistical testing and should be used for illustrative purposes only.

András Aszódi
  • 8,948
  • 5
  • 48
  • 51
3

It exists now in the statsmodels package:

http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html

grasshopper
  • 3,988
  • 3
  • 23
  • 29
2

You can use bokeh

from bokeh.plotting import figure, show
from scipy.stats import probplot
# pd_series is the series you want to plot
series1 = probplot(pd_series, dist="norm")
p1 = figure(title="Normal QQ-Plot", background_fill_color="#E8DDCB")
p1.scatter(series1[0][0],series1[0][1], fill_color="red")
show(p1)
sushmit
  • 4,369
  • 2
  • 35
  • 38
2
import numpy as np 
import pylab 
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)   
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()

Here probplot draw the graph measurements vs normal distribution which speofied in dist="norm"

Ravi
  • 2,778
  • 2
  • 20
  • 32
1

How big is your sample? Here is another option to test your data against any distribution using OpenTURNS library. In the example below, I generate a sample x of 1.000.000 numbers from a Uniform distribution and test it against a Normal distribution. You can replace x by your data if you reshape it as x= [[x1], [x2], .., [xn]]

import openturns as ot

x = ot.Uniform().getSample(1000000)
g = ot.VisualTest.DrawQQplot(x, ot.Normal())
g

In my Jupyter Notebook, I see: enter image description here

If you are writing a script, you can do it more properly

from openturns.viewer import View`
import matplotlib.pyplot as plt
View(g)
plt.show()
Jean A.
  • 291
  • 1
  • 17