2

I am new to Machine Learning and trying my hands on Bitcoin Price Prediction using multiple Models like Random Forest, Simple Linear Regression and NN(LSTM).

As far as I have read, Random Forest and Linear regression don't require the input feature scaling, whereas LSTM does need the input features to be scaled.

If we compare the MAE and RMSE for both algorithms (with scaling and without scaling), the result would definitely be different and I can't compare which model performs better.

How should I compare the performance of these models now?


Update - Adding my code

Data

bitcoinData = pd.DataFrame([[('2013-04-01 00:07:00'),93.25,93.30,93.30,93.25,93.300000], [('2013-04-01 00:08:00'),100.00,100.00,100.00,100.00,93.300000], [('2013-04-01 00:09:00'),93.30,93.30,93.30,93.30,33.676862]], columns=['time','open', 'close', 'high','low','volume'])
bitcoinData.time = pd.to_datetime(bitcoinData.time)
bitcoinData = bitcoinData.set_index(['time'])
x_train = train_data[['high','low','open','volume']]
y_train = train_data[['close']]
x_test = test_data[['high','low','open','volume']]
y_test = test_data[['close']]

Min-Max Scaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
scaler1 = MinMaxScaler(feature_range=(0, 1))
x_train = scaler.fit_transform(x_train)
y_train = scaler1.fit_transform(y_train)
x_test = scaler.transform(x_test)
y_test = scaler1.transform(y_test)

MSE Calculation

from math import sqrt
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
print("Root Mean Squared Error(RMSE) : ", sqrt(mean_squared_error(y_test,preds)))
print("Mean Absolute Error(MAE) : ", mean_absolute_error(y_test,preds))
r2 =  r2_score(y_test, preds)
print("R Squared (R2) : ",r2)
Divya Kaushik
  • 820
  • 1
  • 8
  • 9

1 Answers1

1

You scale your input data, not the output. The input data is irrelevant to your error calculation.

If you really want to scale your LSTM output data, just scale it the same way for the other classifiers.

EDIT:

From your comment:

I only scaled my input data in LSTM

No, you don't. You do transform your output data. And from what I read, I assume you only transform it for the neural network.

So your y data for the LSTM is around 100 times smaller --> squared_error, so you get 100*100 = 10.000, which roughly is the factor your neural net performs "better" than the random forest.

Option 1:

Remove those three lines:

scaler1 = MinMaxScaler(feature_range=(0, 1))
y_train = scaler1.fit_transform(y_train)
y_test = scaler1.transform(y_test)

Don't forget to use a last layer that can output values to + infinity

Option 2:

Scale the data for your other classifiers as well and compare the scaled values.

Option 3:

Use inverse_transform(pred) method of your MinMaxScaler() on your predictions and calculate your errors with the inverse_transform()ed predictions and the untransformed y_test data.

Mario
  • 1,631
  • 2
  • 21
  • 51
Florian H
  • 3,052
  • 2
  • 14
  • 25
  • I only scaled my input data in LSTM(using sci-kit Min-Max scaler) When i apply fit(), and calculate y_pred value, i use sqrt(mean_squared_error(y_test,preds)) to calculate RMSE. LSTM gives a value of 0.000959 whereas Random Forest(without any input data scaling) gives RMSE as 3.7267. – Divya Kaushik Sep 16 '19 at 11:30
  • Then i say your lstm is about 10.000 times better than your random forest, until you post the code of your classifiers, how you calculate the RMSE and a few x/y examples of your data. – Florian H Sep 16 '19 at 12:01
  • I edited my answer, am i right with the assumption that you only scale the data for the neural network? – Florian H Sep 16 '19 at 13:30
  • Thanks Florian. I didn't see the scaling I was doing on output data. – Divya Kaushik Sep 16 '19 at 14:48