How to use Python to draw a normal probability plot by using certain column data in dataFrame

Question

I have a Data Frame that contains two columns named, "thousands of dollars per year", and "EMPLOY".

I create a new variable in this data frame named "cubic_Root" by computing the data in df['thousands of dollars per year']

df['cubic_Root'] = -1 / df['thousands of dollars per year'] ** (1. / 3)

The data in df['cubic_Root'] like that:

ID cubic_Root

1 -0.629961

2 -0.405480

3 -0.329317

4 -0.480750

5 -0.305711

6 -0.449644

7 -0.449644

8 -0.480750

Now! How can I draw a normal probability plot by using the data in df['cubic_Root'].

Check out this: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.probplot.html — juanpa.arrivillaga, Sep 09 '17 at 04:27

score 6 · Accepted Answer · answered Sep 09 '17 at 05:05

You want the "Probability" Plots.

So for a single plot, you'd have something like below.

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

# 100 values from a normal distribution with a std of 3 and a mean of 0.5
data = 3.0 * np.random.randn(100) + 0.5

counts, start, dx, _ = scipy.stats.cumfreq(data, numbins=20)
x = np.arange(counts.size) * dx + start

plt.plot(x, counts, 'ro')
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')

plt.show()

If you want to plot a distribution, and you know it, define it as a function, and plot it as so:

import numpy as np
from matplotlib import pyplot as plt

def my_dist(x):
    return np.exp(-x ** 2)

x = np.arange(-100, 100)
p = my_dist(x)
plt.plot(x, p)
plt.show()

If you don't have the exact distribution as an analytical function, perhaps you can generate a large sample, take a histogram and somehow smooth the data:

import numpy as np
from scipy.interpolate import UnivariateSpline
from matplotlib import pyplot as plt

N = 1000
n = N/10
s = np.random.normal(size=N)   # generate your data sample with N elements
p, x = np.histogram(s, bins=n) # bin it into n = N/10 bins
x = x[:-1] + (x[1] - x[0])/2   # convert bin edges to centers
f = UnivariateSpline(x, p, s=n)
plt.plot(x, f(x))
plt.show()

You can increase or decrease s (smoothing factor) within the UnivariateSpline function call to increase or decrease smoothing. For example, using the two you get:

Probability density Function (PDF) of inter-arrival time of events.

import numpy as np
import scipy.stats

# generate data samples
data = scipy.stats.expon.rvs(loc=0, scale=1, size=1000, random_state=123)

A kernel density estimation can then be obtained by simply calling

scipy.stats.gaussian_kde(data,bw_method=bw)

where bw is an (optional) parameter for the estimation procedure. For this data set, and considering three values for bw the fit is as shown below

# test values for the bw_method option ('None' is the default value)
bw_values =  [None, 0.1, 0.01]

# generate a list of kde estimators for each bw
kde = [scipy.stats.gaussian_kde(data,bw_method=bw) for bw in bw_values]


# plot (normalized) histogram of the data
import matplotlib.pyplot as plt 
plt.hist(data, 50, normed=1, facecolor='green', alpha=0.5);

# plot density estimates
t_range = np.linspace(-2,8,200)
for i, bw in enumerate(bw_values):
    plt.plot(t_range,kde[i](t_range),lw=2, label='bw = '+str(bw))
plt.xlim(-1,6)
plt.legend(loc='best')

Reference:

Python: Matplotlib - probability plot for several data set

how to plot Probability density Function (PDF) of inter-arrival time of events?

@PulkitKedia please refer https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cumfreq.html — Tushar Gupta, Mar 25 '18 at 10:39
Normal probability plot are plotted with Z-scores on Y-axis but here binned vaues of cumulative frequency is used (referring to the single plot).. why is it so ?? — Pulkit Kedia, Apr 01 '18 at 16:39

How to use Python to draw a normal probability plot by using certain column data in dataFrame

1 Answers1

Linked