Calculate and plot 95% range of data on scatter plot in Python

Question

I wish to know, for a given predicted commute journey duration in minutes, the range of actual commute times I might expect. For example, if Google Maps predicts my commute to be 20 minutes, what is the minimum and maximum commute I should expect (perhaps a 95% range)?

Let's import my data into pandas:

%matplotlib inline
import pandas as pd

commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')
commutes.tail()

This gives:

We can create a plot easily which shows the scatter of raw data, a regression curve, and the 95% confidence interval on that curve:

import seaborn as sns

# Create a linear model plot
sns.lmplot('prediction', 'duration', commutes);

How do I now calculate and plot the 95% range of actual commute times versus predicted times?

Put another way, if Google Maps predicts my commute to take 20 minutes, it looks like it could actually take anywhere between something like 14 and 28 minutes. It would be great to calculate or plot this range.

Thanks in advance for any help.

blokeley · Accepted Answer · 2017-03-17T12:02:22.130

The relationship between actual duration of a commute and the prediction should be linear, so I can use quantile regression:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

# Import data and print the last few rows
commutes = pd.read_csv('https://raw.githubusercontent.com/blokeley/commutes/master/commutes.csv')

# Create the quantile regression model
model = smf.quantreg('duration ~ prediction', commutes)

# Create a list of quantiles to calculate
quantiles = [0.05, 0.25, 0.50, 0.75, 0.95]

# Create a list of fits
fits = [model.fit(q=q) for q in quantiles]

# Create a new figure and axes
figure, axes = plt.subplots()

# Plot the scatter of data points
x = commutes['prediction']
axes.scatter(x, commutes['duration'], alpha=0.4)

# Create an array of predictions from the minimum to maximum to create the regression line
_x = np.linspace(x.min(), x.max())

for index, quantile in enumerate(quantiles):
    # Plot the quantile lines
    _y = fits[index].params['prediction'] * _x + fits[index].params['Intercept']
    axes.plot(_x, _y, label=quantile)

# Plot the line of perfect prediction
axes.plot(_x, _x, 'g--', label='Perfect prediction')
axes.legend()
axes.set_xlabel('Predicted duration (minutes)')
axes.set_ylabel('Actual duration (minutes)');

This gives:

Many thanks to my colleague Philip for the quantile regression tip.

score -1 · Answer 2 · answered Mar 01 '17 at 17:56

-1

You should fit your data in a gaussian distribution within 3 sigma std dev wich will represent something around 96% of your results.

Look after Normal Distribution.

answered Mar 01 '17 at 17:56

Pedro

1,121
7
16

I know how to do this for 1-dimensional data, but what about 2-dimensional data that I have? To draw the distinction, how do I know the range of actual values if my predicted value is not an integer? – blokeley Mar 01 '17 at 21:33
Also, the data do not fit a Gaussian normal distribution, so standard deviation does not tell me much. I think that this is something to do with calculating percentiles. – blokeley Mar 02 '17 at 07:41

Calculate and plot 95% range of data on scatter plot in Python

2 Answers2

Linked