0

[Update: An answer by desertnaut below caught a coding mistake (doubly un-scaling my predictions, leading them to be ~ E+22), but even after correcting this, my problem remains.]


I'm struggling to understand where my mistake is. I'm playing with machine learn to understand how to do regressions better, but I'm not able to successfully do a simple regression as my results are way off.

I'm taking 3 financial variables, using 2 of them to calculate a new column then trying to see if my model can predict the column I created. I thought this would be easy because I'm taking 2 variables and subtracting them. I had a low MSE score which seemed promising but when I look at my actual results they are way off.

Here are my results (note I'm calculating grossProfit1 from totalRevenue-costOfRevenue. The prediction column is my prediction, which as you can see is way off):

     grossProfit1  totalRevenue Exchange  costOfRevenue    prediction
0    9.839200e+10  2.601740e+11   NASDAQ   1.617820e+11  1.115318e+11
1    9.839200e+10  2.601740e+11   NASDAQ   1.617820e+11  1.115318e+11
2    1.018390e+11  2.655950e+11   NASDAQ   1.637560e+11  1.137465e+11
3    1.018390e+11  2.655950e+11   NASDAQ   1.637560e+11  1.137465e+11
4    8.818600e+10  2.292340e+11   NASDAQ   1.410480e+11  9.953879e+10
..            ...           ...      ...            ...           ...
186  4.224500e+10  9.113400e+10     NYSE   4.888900e+10  4.286892e+10
187  4.078900e+10  9.629300e+10     NYSE   5.550400e+10  4.505001e+10
188  3.748200e+10  8.913100e+10     NYSE   5.164900e+10  4.277003e+10
189  3.397500e+10  8.118600e+10     NYSE   4.721100e+10  4.012077e+10
190  3.597700e+10  8.586600e+10     NYSE   4.988900e+10  4.168953e+10

Here's my simplified code (cleaned up variable names, commented it more, etc), updated based on double inverse scaling as per @desertnaut answer below:

#data
#create grossProfit column, to predict
df['grossProfit1'] = df['totalRevenue'] - df['costOfRevenue'] 
variableToPredict = ['grossProfit1']
#all columns we are using - grossProfit1 is what we will predict, and it's created simply from substracting totalRevenue from costOfRevenue. Exchange is just there for see if neural can ignore it.
df2=df[['grossProfit1','totalRevenue','Exchange', 'costOfRevenue']]
#I process this data frame, remove duplicates, drop variable to predict,etc.

#create the dataset for prediction
X_train=df2
X_train = X_train.drop(columns=variableToPredict) 
#add data for features 
y_train=df2[variableToPredict]

#check to see if there is any catergories
catergoryEncoder = OrdinalEncoder()
columnsObjects = list(cleanData.select_dtypes(include=['object']).columns) 
if len(columnsObjects) != 0:
    X_train[columnsObjects] = catergoryEncoder.fit_transform(X_train[columnsObjects])

#scale the data
scaler_X = MinMaxScaler()
scaler_Y = MinMaxScaler()
Xscaled = scaler_X.fit_transform(X_train)
unscaled = scaler_X.inverse_transform(Xscaled)
Yscaled = scaler_Y.fit_transform(y_train)

#run simple model for prediction:
model = tf.keras.Sequential() #using tensorflow keras
model.add(layers.Dense(64, activation='relu', input_shape=(numInputColumns,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['MSE'])
model.fit(Xscaled,Yscaled,epochs=10,validation_split=0.2)

# this is the result - seems good with low MSE.
# Epoch 10/10
# 152/152 [==============================] - 0s 174us/sample - loss: 0.0075 - MSE: 0.0075 - val_loss: 0.0076 - val_MSE: 0.0076

#do the predictions on previous data
prediction = model.predict(Xscaled)
# prediction = scaler.inverse_transform(prediction)  # REMOVED after desertnaut's answer below

#join all the data together
new_data_frame=pd.DataFrame(df2) 
new_data_frame['prediction'] = (scaler_Y.inverse_transform(np.asarray(prediction)))
print(new_data_frame)

What am I doing wrong with this regression? With other tutorials, it works well (Boston home prices, Miles per gallon regression tutorials). I was advised for regression problems, MSE is a better indicator than accuracy (better for classification problems) and the MSE looks fine but I'm not sure why my results are so far off.

halfer
  • 19,824
  • 17
  • 99
  • 186
Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • As already argued, updating a question *after* an answer has been offered by adopting the suggestions of that answer (thus making it look irrelevant) is not how SO works and not appreciated. – desertnaut Apr 01 '20 at 20:48
  • @desertnaut I can revert the question but posting a question, having an answer that doesn't resolve the question but marking it complete would ultimately direct people to the wrong results when they search. I thought cleaning the code based on your suggestion would help others but feel free to revert it if you like but I cannot mark your solution as accepted as it didn't resolve the problem. – Lostsoul Apr 01 '20 at 20:51
  • Well, depends how you define "predictions way off". Your initial ones were indeed - E+22 to true targets of E+10 *is* way off, and you had a right to be puzzled (something nowhere to be seen now)! Now, you essentially have the difference between 98 and 111. So, one can argue that your problem ("way off") is essentially resolved, and asking why they are not more *accurate* is another issue... Let me make a prediction: **no one** will come up with an explanation for this, and most people will even consider it off-topic for SO (not a *coding* issue, which SO is actually all about). – desertnaut Apr 01 '20 at 21:00
  • @desertnaut I've double-checked the data, my suspicion is my model is wrong or an error in how I'm using the libaries. I'm here looking for guidance. – Lostsoul Apr 01 '20 at 21:07
  • Don't know your data, but there is no error in using the libraries. You did have a serious error in your pipeline, which I resolved; I also offered some additional general guidance and advice. I am here to offer guidance to the extend I can, and the only thing I ask in return is reason and elementary appreciation. It seems I got nothing of the two here; this might be expected from SO newbies, but not from users of your standing. Under these circumstances, I regret to say I am not willing to spend any minute more from my time to help you, now or in the future. Good luck, and all the best. – desertnaut Apr 01 '20 at 21:21
  • ... well, I just saw that you used *and* my `scaler_X` & `scaler_Y` advice in your update! Good thing - future visitors will sure think I am a complete idiot for suggesting stuff already included in your approach!!! – desertnaut Apr 01 '20 at 21:32
  • @desertnaut I def, appreciate your help and am really grateful for it. I just cannot mark your answer as the accepted answer then my issue was not resolved. – Lostsoul Apr 01 '20 at 22:02
  • Your predictions went from "way off" to "not good enough"; your issue was resolved, and you could always open a new question on the different issue of "not good enough", in which I would be happy to contribute. Your "appreciation" and "gratefulness" was to appropriate my answer for updating your question w/o any reference or attribution to my answer and not even upvoting it (I had to ping you in the comments for this). Sorry... – desertnaut Apr 01 '20 at 23:21
  • @desertnaut I'm sorry you feel this way. The reason I didn't upvote was because it wasn't a solution to my problem. As much as you would like to have an accepted answer, It's not fair as you did not solve the issue. I'm not okay with changing the question for your solution and then recreating a question for my original problem. Agains, concerns like this should be raised in meta not here so I won't be able to engage with you further unless it's related to a solution. – Lostsoul Apr 02 '20 at 00:09
  • You *did* change the question - this is my major concern (and the way you did it), not of course the not acceptance! Be my guest with the Meta. And expecting me to engage with you for any solution whatsoever is a joke, of course. – desertnaut Apr 02 '20 at 00:16
  • BTW, here is the correct practice, as it coincidentally happened just now: [first question](https://stackoverflow.com/questions/60988986/loading-keras-model-and-making-prediction-with-it), [follow up question](https://stackoverflow.com/questions/60989899/big-difference-in-accuracy-after-training-the-model-and-after-loading-that-model). – desertnaut Apr 02 '20 at 11:48

1 Answers1

0

You have an issue with unscaling your predictions; you first do it here:

#do the predictions on previous data
prediction = model.predict(Xscaled)
prediction = scaler.inverse_transform(prediction) 

and then you do it again when building your dataframe:

new_data_frame=pd.DataFrame(df2) 
new_data_frame['prediction'] = (scaler.inverse_transform(np.asarray(prediction)))

Not quite sure of the effects, but it is a mistake nevertheless. It is a also a good example for why it's not a good idea to use the same name for variables after transforming them (suddenly, you're not sure if prediction is inversely scaled or not).

In general, you should use two separate scalers for your features and labels:

#scale the data
scaler_X = MinMaxScaler()
Xscaled = scaler_X.fit_transform(X_train)
scaler_Y = MinMaxScaler()
Yscaled = scaler_Y.fit_transform(y_train)

Doesn't seem to be a problem here, but this seems coincidental, due to the order with which you use the relevant commands; it needs attention.

A good point to also keep in mind is that the MSE returned by Keras in similar cases is actually a scaled MSE. To get the "true" MSE for your predictions, you should do

y_pred_scaled = model.predict(Xscaled)
from sklearn.metrics import mean_squared_error
MSE_true = mean_squared_error(Y, scaler_Y.inverse_transform(y_pred_scaled)) 

provided that Y here are your initial unscaled targets.

See own answer in How to interpret MSE in Keras Regressor for more.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Thank you for catching that! I updated the code and tried it again but there wasn't a drastic change in results. I did update the code/results in the question. Thanks for your tip! – Lostsoul Apr 01 '20 at 20:12
  • I'm just not sure what I'm doing wrong. Although I'm doing a prediction, I'm creating the value I'm predicting using a simple formula. Do you think it's my scaling or model? I've tried to play with layers but I'm not really sure what I'm doing and feel there's a gap somewhere in my understanding. – Lostsoul Apr 01 '20 at 20:13
  • @Lostsoul sorry, but predictions going from E+22 to E+11 with true targets being 9.8E+10 (i.e. almost E+11) is a **huge** difference! – desertnaut Apr 01 '20 at 20:15
  • You are right, I apologize. I was thinking more along the lines that it's still exponentially off, I was hoping because the formula was simply Y = A - B, that the accuracy would be significantly closer. – Lostsoul Apr 01 '20 at 20:19
  • @Lostsoul you are a seasoned user, and you probably know that updating questions in such a manner *after* an answer has been provided by adopting the approach suggested in said answer is not how SO works. Please roll back the question to its previous version (now the answer obviously looks ridiculous and plain wrong!), and it's up to you if you'll accept the answer (and possibly open a new question) or if you'll just update the question with the new status (not recommended). – desertnaut Apr 01 '20 at 20:21
  • I can roll back the question, I had only edited it to take your best practice into account and remove the double scaling I did. The problem I came here for still exists so I'm not sure if closing this and then creating the same question again would flag it as a duplicate. I updated the question to give you credit for code clean up, and upvoted your answer. – Lostsoul Apr 01 '20 at 20:25
  • @Lostsoul please be reasonable; follow-up questions with links between them cannot be considered duplicates. – desertnaut Apr 01 '20 at 20:29
  • I'm trying to be reasonable but in principle, the question I asked was not resolved. Your answer pointed out a mistake I made but did not fix the actual problem I'm facing. I feel I'd asked the same question again. I want to be fair, can we raise this in meta? – Lostsoul Apr 01 '20 at 20:36
  • @Lostsoul 1) questions are free 2) questions can be follow-up of previous ones 3) I spent a good time resolving your coding mistake and offering additional advice & good practices, with your first reaction being just to adopt the answer and update the question w/o any mention to my answer. Not appreciated. If you wanna raise this in Meta, be my guest, but I can guess the reactions there (as said, this is not how SO works). – desertnaut Apr 01 '20 at 20:52