1

How do I get both lower and high 95% confidence or prediction interval columns for my prediction?

df1 = pd.DataFrame({
        'cumsum_days': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
        'prediction': [800, 900, 1200, 700, 600, 
                 550, 500, 650, 625, 600,
                550, 525, 500, 400, 350]})

Desired dataframe looks something like this:

prediction  lower_ci   high_ci
800         some_num   some num
900         some_num   some num
1200        some_num   some num
700         some_num   some num

These functions only give me single digits, however I am looking for 95% confidence intervals for df.prediction (15 datapoints a piece).

mean = df.prediction.mean()
std = df.prediction.std()

I've also tried this (below), however it only gives me three values, instead of 2 additional arrays of confidence bands / intervals for my predicted values:

import numpy as np
import scipy.stats


def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h
Starbucks
  • 1,448
  • 3
  • 21
  • 49
  • Does this help you: https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data? – Code Different Jan 23 '20 at 21:36
  • This only gives me three values, rather 2 arrays of confidence intervals for my predicted values. – Starbucks Jan 23 '20 at 21:38
  • Pandas' `plot` has a `yerr` argument. Try: `df1.set_index('cumsum_days').pipe(lambda d: d.plot(yerr={**d.std()}))` – piRSquared Jan 23 '20 at 21:40
  • Maybe this points you in the right direction? https://stackoverflow.com/questions/53519823/confidence-interval-in-python-dataframe – Lapis Rose Jan 23 '20 at 21:43
  • @LapisRose, this produces confidence interval metrics for an entire array. I need two confidence bands (arrays) for my predicted values. – Starbucks Jan 23 '20 at 21:45
  • 2
    Then you need to go back to whatever did the prediction and get the standard errors and confidence intervals from there. Because at this point it's impossible. – ALollz Jan 23 '20 at 21:47
  • @ALollz, I used scipy's curve_fit, which I dont think has the confidence bands option. – Starbucks Jan 23 '20 at 21:58
  • 1
    @Starbucks get residuals from the training set and calculate standard deviation, then use standard normal quantiles to get CI (1.96 * std for 95%) – Marat Jan 24 '20 at 02:16

1 Answers1

0

How about something like this?

bins = [0, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6, 6.25, 6.5, 6.75, 7, 7.25, 7.5, 7.75, 8, 8.25, 8.5, 8.75, 9, 9.25, 9.5, 9.75, 10, np.inf]
labels = ['0', '1', '1.25', '1.5', '1.75', '2', '2.25', '2.5', '2.75', '3', '3.25', '3.5', '3.75', '4', '4.25', '4.5', '4.75', '5', '5.25', '5.5', '5.75', '6', '6.25', '6.5', '6.75', '7', '7.25', '7.5', '7.75', '8', '8.25', '8.5', '8.75', '9', '9.25', '9.5', '9.75', '10']

dataset['RatingScore'] = pd.cut(dataset['Rating'], bins=bins, labels=labels, right=True)

You can create your basic setup and then convert the final object into a dataframe.

ASH
  • 20,759
  • 19
  • 87
  • 200