[Update: An answer by desertnaut below caught a coding mistake (doubly un-scaling my predictions, leading them to be ~ E+22
), but even after correcting this, my problem remains.]
I'm struggling to understand where my mistake is. I'm playing with machine learn to understand how to do regressions better, but I'm not able to successfully do a simple regression as my results are way off.
I'm taking 3 financial variables, using 2 of them to calculate a new column then trying to see if my model can predict the column I created. I thought this would be easy because I'm taking 2 variables and subtracting them. I had a low MSE score which seemed promising but when I look at my actual results they are way off.
Here are my results (note I'm calculating grossProfit1 from totalRevenue-costOfRevenue. The prediction column is my prediction, which as you can see is way off):
grossProfit1 totalRevenue Exchange costOfRevenue prediction
0 9.839200e+10 2.601740e+11 NASDAQ 1.617820e+11 1.115318e+11
1 9.839200e+10 2.601740e+11 NASDAQ 1.617820e+11 1.115318e+11
2 1.018390e+11 2.655950e+11 NASDAQ 1.637560e+11 1.137465e+11
3 1.018390e+11 2.655950e+11 NASDAQ 1.637560e+11 1.137465e+11
4 8.818600e+10 2.292340e+11 NASDAQ 1.410480e+11 9.953879e+10
.. ... ... ... ... ...
186 4.224500e+10 9.113400e+10 NYSE 4.888900e+10 4.286892e+10
187 4.078900e+10 9.629300e+10 NYSE 5.550400e+10 4.505001e+10
188 3.748200e+10 8.913100e+10 NYSE 5.164900e+10 4.277003e+10
189 3.397500e+10 8.118600e+10 NYSE 4.721100e+10 4.012077e+10
190 3.597700e+10 8.586600e+10 NYSE 4.988900e+10 4.168953e+10
Here's my simplified code (cleaned up variable names, commented it more, etc), updated based on double inverse scaling as per @desertnaut answer below:
#data
#create grossProfit column, to predict
df['grossProfit1'] = df['totalRevenue'] - df['costOfRevenue']
variableToPredict = ['grossProfit1']
#all columns we are using - grossProfit1 is what we will predict, and it's created simply from substracting totalRevenue from costOfRevenue. Exchange is just there for see if neural can ignore it.
df2=df[['grossProfit1','totalRevenue','Exchange', 'costOfRevenue']]
#I process this data frame, remove duplicates, drop variable to predict,etc.
#create the dataset for prediction
X_train=df2
X_train = X_train.drop(columns=variableToPredict)
#add data for features
y_train=df2[variableToPredict]
#check to see if there is any catergories
catergoryEncoder = OrdinalEncoder()
columnsObjects = list(cleanData.select_dtypes(include=['object']).columns)
if len(columnsObjects) != 0:
X_train[columnsObjects] = catergoryEncoder.fit_transform(X_train[columnsObjects])
#scale the data
scaler_X = MinMaxScaler()
scaler_Y = MinMaxScaler()
Xscaled = scaler_X.fit_transform(X_train)
unscaled = scaler_X.inverse_transform(Xscaled)
Yscaled = scaler_Y.fit_transform(y_train)
#run simple model for prediction:
model = tf.keras.Sequential() #using tensorflow keras
model.add(layers.Dense(64, activation='relu', input_shape=(numInputColumns,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['MSE'])
model.fit(Xscaled,Yscaled,epochs=10,validation_split=0.2)
# this is the result - seems good with low MSE.
# Epoch 10/10
# 152/152 [==============================] - 0s 174us/sample - loss: 0.0075 - MSE: 0.0075 - val_loss: 0.0076 - val_MSE: 0.0076
#do the predictions on previous data
prediction = model.predict(Xscaled)
# prediction = scaler.inverse_transform(prediction) # REMOVED after desertnaut's answer below
#join all the data together
new_data_frame=pd.DataFrame(df2)
new_data_frame['prediction'] = (scaler_Y.inverse_transform(np.asarray(prediction)))
print(new_data_frame)
What am I doing wrong with this regression? With other tutorials, it works well (Boston home prices, Miles per gallon regression tutorials). I was advised for regression problems, MSE is a better indicator than accuracy (better for classification problems) and the MSE looks fine but I'm not sure why my results are so far off.