Python Linear Regression always 100 % accuracy

Question

Hey guys i have an issue with my exam project. I am trying to create a very simple Stock predicter, using a web-api called Iextrading, that returns me the stocks for Telsa the last 5 years in json format, nothing fancy. I then want to be able to predict the stock for tomorow(the next day). However, i must admit that i am feeling very lost doing machine-learning. I think i have managed to create the ai-model. But it always says 100% accuracy, which i know should't be true/possible. To be honest i don't even know where to look for the problem, i am guessing it must be related to the test/train data. And i guess once this is done, then i need to find out how to only give the model tomorow's date as input for prediction.

Here is my code, thanks alot in advance:

import matplotlib 
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn import datasets, linear_model 
import sklearn.metrics as sm
import pandas as pd 

data = pd.read_json('https://api.iextrading.com/1.0/stock/tsla/chart/5y')
data.head()

data = data.iloc[:, :]

from sklearn import preprocessing
enc = preprocessing.LabelEncoder()
enc.fit(data['date'])
data['date'] = enc.transform(data['date'])

#Label is like a date expression ex. "Dec 13", "Nov 12"
from sklearn import preprocessing
enc2 = preprocessing.LabelEncoder()
enc2.fit(data['label'])
data['label'] = enc2.transform(data['label'])

X = data.iloc[:, :-1].values 
X = data.drop('close', axis=1)
y = data.iloc[:, 3] 

# Split in train and test
num_training = int(0.8 * len(X))
num_test = len(X) - num_training

# Training data
X_train, y_train = X[:num_training], y[:num_training]

# Test data
X_test, y_test = X[num_training:], y[num_training:]

# Create linear regressor object
regressor = linear_model.LinearRegression()

# Train the model using the training sets
regressor.fit(X_train, y_train)

# Predict the output
y_test_pred = regressor.predict(X_test)

# Compute performance metrics
print("Linear regressor performance:")
print("Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2)) 
print("Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2)) 
print("Explain variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))

# Perform prediction on train data, reuse
y_test_pred_new = regressor.predict(X_test)
print("\nNew mean absolute error =", round(sm.r2_score(y_test, y_test_pred_new), 2))

Here is an example of the data

Data columns (total 12 columns):
change              1258 non-null float64
changeOverTime      1258 non-null float64
changePercent       1258 non-null float64
close               1258 non-null float64
date                1258 non-null datetime64[ns]
high                1258 non-null float64
label               1258 non-null object
low                 1258 non-null float64
open                1258 non-null float64
unadjustedVolume    1258 non-null int64
volume              1258 non-null int64
vwap                1258 non-null float64
dtypes: datetime64[ns](1), float64(8), int64(2), object(1)

#Example Values from data entry: 0
change : 0.184
changeOverTime: 0.000000
changePercent: 0.125
close: 147.654
date: 2013-12-13
high: 151.80
label: Dec 13, 13
low: 147.3200
open: 148.05
unadjustedVolume: 10599775
volume: 10599775
vwap: 149.5224

It sounds like you likely have [data leakage](https://www.kaggle.com/dansbecker/data-leakage) between your training data and your targets. Can you provide a few sample rows of your data? — G. Anderson, Dec 13 '18 at 18:39
Why using a `LabelEncoder` for your `label` in a *regression* problem? Aren't your "labels" continuous numerical values? — desertnaut, Dec 13 '18 at 18:43
@G.Anderson I have included in the original post an example of the data. Hope this helps — ThePrograminator, Dec 13 '18 at 18:59
@desertnaut label in the context of the data is like a date, i have provided an example in the post. However, i had to use LabelEncoder, else the model wouldnt fit the coloums Label and date because they were a string or datetime. Hope this clarifies — ThePrograminator, Dec 13 '18 at 19:00
Regarding the date/labels. at the very least I would remove one or the other, because from your sample they are the same thing so having both is redundant. Further, do you think the date of the year is going to be a predictor of stock close price? — G. Anderson, Dec 13 '18 at 19:12
In your question you say "Accuracy 100%", but you're using MSE, What are your actual values for `"Linear regressor performance:"`? — G. Anderson, Dec 13 '18 at 19:14
@Blorgbeard, well the output i get is 100% prediction, the predicted values are 100(99,99%) the same as the trainging/test data — ThePrograminator, Dec 13 '18 at 19:14
@G.Anderson Removing the label could be an idea yes. Well i know it prop is a terrible predictor, but it doesn't have to be "real world accurate". I just need something that kinda works — ThePrograminator, Dec 13 '18 at 19:16
@G.Anderson To your second question the actual values for the Linear regressor performance is: Mean absolute error = 0.0, Mean squared error = 0.0, Median absolute error = 0.0, Explain variance score = 0.0, R2 score = 1.0 — ThePrograminator, Dec 13 '18 at 19:17
There are multiple problems: (1) The `changeOverTime` feature is a perfect predictor of the output `close`. It is a cheating feature which you should remove. (2) The two lines `X = data.iloc[:, :-1].values` and `X = data.drop('close', axis=1)` are reversed. You want `X` to be a matrix, not a dataframe. The second line overwrites the `X` from the first line. — stackoverflowuser2010, Dec 13 '18 at 19:21
@stackoverflowuser2010 i will remove `changeOverTime` then thanks. In regards to point 2. Can you elaborate on the `"inverted"`. Also so, should X be a `numpy array`, is that the same as an `matrix`? — ThePrograminator, Dec 13 '18 at 19:26
In general: (1) use `df.head()` aggressively at the start of any experiment to inspect the data to make sure there is nothing fishy among the variables. (2) In linear regression, if you get perfect accuracy (i.e. zero error), then that means that there is a perfect correlation between one or more of the features and the output variable. You can find these correlations using Pearson correlation or `corrplot()` at a glance. (3) In general, I don't like using `df.drop()` to remove features. I would prefer to select features so that I know exactly what features are in the dataframe. — stackoverflowuser2010, Dec 13 '18 at 19:27
Try: `y = data[['close']]`, then `temp = data[['feature1', 'feature2', 'feature3']]`, then `X = temp.values`. — stackoverflowuser2010, Dec 13 '18 at 19:31
@stackoverflowuser2010 These all sound like great tips, thanks! Is the `Pearson correlation` related to the `Seaborn heatmap`? — ThePrograminator, Dec 13 '18 at 19:33
See https://stackoverflow.com/questions/11285613/selecting-multiple-columns-in-a-pandas-dataframe for help on selecting columns by name. — stackoverflowuser2010, Dec 13 '18 at 19:33
You also should make sure that you're using features from one day and the predicted `close` from the next day. That is, you shouldn't use features and the `close` from the same day. — stackoverflowuser2010, Dec 13 '18 at 21:57

Python Linear Regression always 100 % accuracy

0 Answers0