How to predict a value with linear regression?

Question

I want to predict the behavior of my data in the future. The value of my data x and y is about 1000 values. I want to predict the value y[1001]. This is my example.

from numpy.random import randn
from numpy.random import seed
from numpy import sqrt
import numpy
from numpy import sum as arraysum
from scipy.stats import linregress
from matplotlib import pyplot

seed(1)
x = 20 * randn(1000) + 100
print(numpy.size(x))
y = x + (10 * randn(1000) + 50)
print(numpy.size(y))
# fit linear regression model
b1, b0, r_value, p_value, std_err = linregress(x, y)
# make predictions
yhat = b0 + b1 * x
# define new input, expected value and prediction
x_in = x[1001]
y_out = y[1001]
yhat_out = yhat[1001]
# estimate stdev of yhat
sum_errs = arraysum((y - yhat)**2)
stdev = sqrt(1/(len(y)-2) * sum_errs)
# calculate prediction interval
interval = 1.96 * stdev
print('Prediction Interval: %.3f' % interval)
lower, upper = y_out - interval, y_out + interval
print('95%% likelihood that the true value is between %.3f and %.3f' % (lower, upper))
print('True value: %.3f' % yhat_out)
# plot dataset and prediction with interval
pyplot.scatter(x, y)
pyplot.plot(x, yhat, color='red')
pyplot.errorbar(x_in, yhat_out, yerr=interval, color='black', fmt='o')
pyplot.show()

When I try that, it gives me this error.

     x_in = x[1001]
IndexError: index 1001 is out of bounds for axis 0 with size 1000

My goal is to predict the behavior of my data in the future and evalute it by plotting its error bars too. I see this example how do you create a linear regression forecast on time series data in python but I don't understand how to apply it to my data. I found that it is possible to use ARIMA model. Please How could I do that?

Look at your `x` definition: if it has 1000 elements, `x[1001]` will throw an `IndexError`. You can define a larger set for `x` (say, 2000 elements), the use the first 1000 to create `y`. Also remember indices in python starts from 0 (not 1). — Tarifazo, Feb 12 '19 at 18:44
@Mstaino Thank you very much for your answer. But, that is my goal is to know x_in value that it is not present in my initial vector. that means I need to estimate the future — dina, Feb 12 '19 at 19:47

score 0 · Answer 1 · answered Feb 12 '19 at 18:48

x = 20 * randn(1000) + 100

^ Here you are creating input vector X with only 1000 values.

y = x + (10 * randn(1000) + 50)

^ and here you creating output vector y with again only 1000 values.

So when you do x_in = x[1001], you are referring to an element that is not present in the input vector as it contains only 1000 elements.

A quick fix would be

seed(1)
x = 20 * randn(1001) + 100
print(numpy.size(x))
y = x + (10 * randn(1001) + 50)
print(numpy.size(y))
# fit linear regression model
b1, b0, r_value, p_value, std_err = linregress(x[:1000], y[:1000])
# make predictions
yhat = b0 + b1 * x
# define new input, expected value and prediction
x_in = x[1000]
y_out = y[1000]

Thank you very much for your answer. That is my goal is to know x_in value that it is not present in my initial vector. — dina, Feb 12 '19 at 19:46

score 0 · Answer 2 · answered Feb 12 '19 at 21:30

Here is code for a graphing ploynomial fitter to fit a first order polynomial using numpy.polyfit() to perform the fit and mu,py.polyval() to predict values. You can experiment with different polynomial orders by changing the line "polynomialOrder = 1" at the top of the code.

import numpy, matplotlib
import matplotlib.pyplot as plt

xData = numpy.array([1.1, 2.2, 3.3, 4.4, 5.0, 6.6, 7.7, 0.0])
yData = numpy.array([1.1, 20.2, 30.3, 40.4, 50.0, 60.6, 70.7, 0.1])

polynomialOrder = 1 # example straight line

# curve fit the test data
fittedParameters = numpy.polyfit(xData, yData, polynomialOrder)
print('Fitted Parameters:', fittedParameters)

modelPredictions = numpy.polyval(fittedParameters, xData)
absError = modelPredictions - yData

SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)

print()


##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = numpy.polyval(fittedParameters, xModel)

    # now the model as a line plot
    axes.plot(xModel, yModel)

    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

thank you very much for your help. Could you please explain for me the role of using numpy.polyfit() to estimate the future of data ( data that is not included in my initial vector data). Thanks in advanc. — dina, Feb 12 '19 at 21:39
See the line of code with "yModel = numpy.polyval(fittedParameters, xModel)" which does exactly what you ask. — James Phillips, Feb 12 '19 at 23:24

How to predict a value with linear regression?

2 Answers2