tensorflow neural network multi layer perceptron for regression example

Question

I am trying to write a MLP with TensorFlow (which I just started to learn, so apologies for the code!) for multivariate REGRESSION (no MNIST, please). Here is my MWE, where I chose to use the linnerud dataset from sklearn. (In reality I am using a much larger dataset, also here I am only using one layer because I wanted to make the MWE smaller, but I can add, if necessary). By the way I am using shuffle = False in train_test_split just because in reality I am working with a time series dataset.

MWE

######################### import stuff ##########################
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split
   

######################## prepare the data ########################
X, y = load_linnerud(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle = False, test_size = 0.33)


######################## set learning variables ##################
learning_rate = 0.0001
epochs = 100
batch_size = 3


######################## set some variables #######################
x = tf.placeholder(tf.float32, [None, 3], name = 'x')   # 3 features
y = tf.placeholder(tf.float32, [None, 3], name = 'y')   # 3 outputs

# input-to-hidden layer1
W1 = tf.Variable(tf.truncated_normal([3,300], stddev = 0.03), name = 'W1')
b1 = tf.Variable(tf.truncated_normal([300]), name = 'b1')  

# hidden layer1-to-output
W2 = tf.Variable(tf.truncated_normal([300,3], stddev = 0.03), name=  'W2')    
b2 = tf.Variable(tf.truncated_normal([3]), name = 'b2')   


######################## Activations, outputs ######################
# output hidden layer 1
hidden_out = tf.nn.relu(tf.add(tf.matmul(x, W1), b1))   

# total output
y_ = tf.nn.relu(tf.add(tf.matmul(hidden_out, W2), b2)) 


####################### Loss Function  #########################
mse = tf.losses.mean_squared_error(y, y_)


####################### Optimizer      #########################
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(mse)  


###################### Initialize, Accuracy and Run #################
# initialize variables
init_op = tf.global_variables_initializer()

# accuracy for the test set
accuracy = tf.reduce_mean(tf.square(tf.subtract(y, y_))) # or could use tf.losses.mean_squared_error

#run
with tf.Session() as sess:
     sess.run(init_op)
     total_batch = int(len(y_train) / batch_size)  
     for epoch in range(epochs):
         avg_cost = 0
         for i in range(total_batch):
              batch_x, batch_y =  X_train[i*batch_size:min(i*batch_size + batch_size, len(X_train)), :], y_train[i*batch_size:min(i*batch_size + batch_size, len(y_train)), :] 
              _, c = sess.run([optimizer, mse], feed_dict = {x: batch_x, y: batch_y}) 
              avg_cost += c / total_batch
         print('Epoch:', (epoch+1), 'cost =', '{:.3f}'.format(avg_cost))
     print(sess.run(mse, feed_dict = {x: X_test, y:y_test}))

This prints out something like this

...
Epoch: 98 cost = 10992.617
Epoch: 99 cost = 10992.592
Epoch: 100 cost = 10992.566
11815.1

So obviously there is something wrong. I am suspecting that the problem is either in the cost function/accuracy or in the way I am using batches, but I can't quite figure it out..

maybe one of the problems is that I am not using regularization? — Euler_Salter, Oct 19 '17 at 14:50
I've tried doing something like `regularizer1 = tf.nn.l2_loss(W1)` and `regularizer2 = tf.nn.l2_loss(W2)` and then adding them to the loss function `mse = tf.losses.mean_squared_error(y, y_) + 0.001 * regularizer1 + 0.001 * regularizer2` but it only got worse.. — Euler_Salter, Oct 19 '17 at 14:51
One word of advice: it's not recommend to use relu activation on the output layer. Instead use liner activation for regression, sigmoid for 2-class classification and softmax for multi-class classification. — BigBadMe, May 05 '18 at 12:09

score 3 · Accepted Answer · answered Oct 19 '17 at 15:00

As far as I can see, the model is learning. I tried to tune some of hyperparameters (most significantly - the learning rate and hidden layer size) and got much better results. Here's the full code:

######################### import stuff ##########################
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split

######################## prepare the data ########################
X, y = load_linnerud(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)

######################## set learning variables ##################
learning_rate = 0.0005
epochs = 2000
batch_size = 3

######################## set some variables #######################
x = tf.placeholder(tf.float32, [None, 3], name='x')  # 3 features
y = tf.placeholder(tf.float32, [None, 3], name='y')  # 3 outputs

# hidden layer 1
W1 = tf.Variable(tf.truncated_normal([3, 10], stddev=0.03), name='W1')
b1 = tf.Variable(tf.truncated_normal([10]), name='b1')

# hidden layer 2
W2 = tf.Variable(tf.truncated_normal([10, 3], stddev=0.03), name='W2')
b2 = tf.Variable(tf.truncated_normal([3]), name='b2')

######################## Activations, outputs ######################
# output hidden layer 1
hidden_out = tf.nn.relu(tf.add(tf.matmul(x, W1), b1))

# total output
y_ = tf.nn.relu(tf.add(tf.matmul(hidden_out, W2), b2))

####################### Loss Function  #########################
mse = tf.losses.mean_squared_error(y, y_)

####################### Optimizer      #########################
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(mse)

###################### Initialize, Accuracy and Run #################
# initialize variables
init_op = tf.global_variables_initializer()

# accuracy for the test set
accuracy = tf.reduce_mean(tf.square(tf.subtract(y, y_)))  # or could use tf.losses.mean_squared_error

# run
with tf.Session() as sess:
  sess.run(init_op)
  total_batch = int(len(y_train) / batch_size)
  for epoch in range(epochs):
    avg_cost = 0
    for i in range(total_batch):
      batch_x, batch_y = X_train[i * batch_size:min(i * batch_size + batch_size, len(X_train)), :], \
                         y_train[i * batch_size:min(i * batch_size + batch_size, len(y_train)), :]
      _, c = sess.run([optimizer, mse], feed_dict={x: batch_x, y: batch_y})
      avg_cost += c / total_batch
    if epoch % 10 == 0:
      print 'Epoch:', (epoch + 1), 'cost =', '{:.3f}'.format(avg_cost)
  print sess.run(mse, feed_dict={x: X_test, y: y_test})

Output:

Epoch: 1901 cost = 173.914
Epoch: 1911 cost = 171.928
Epoch: 1921 cost = 169.993
Epoch: 1931 cost = 168.110
Epoch: 1941 cost = 166.277
Epoch: 1951 cost = 164.492
Epoch: 1961 cost = 162.753
Epoch: 1971 cost = 161.061
Epoch: 1981 cost = 159.413
Epoch: 1991 cost = 157.808
482.433

I think you can tune it even further, but it doesn't make sense since the data is so small. I didn't experiment with regularization though, but I'm sure you'll need it L2 reg or dropout to avoid overfitting.

Thank you! Would you have any other advice? Also, do you think it would be worth using cosine similarity instead of dot product? For instance (I am not even sure this is the correct way of doing it in tensorflow) `hidden_out = tf.nn.relu(tf.add(tf.divide(tf.matmul(x, W1), tf.multiply(tf.norm(x), tf.norm(W1))), b1))` and `y_ = tf.nn.relu(tf.add(tf.divide(tf.matmul(hidden_out, W2), tf.multiply(tf.norm(hidden_out), tf.norm(W2))), b2))` — Euler_Salter, Oct 19 '17 at 15:22
@Euler_Salter It's hard to give specific tips without looking at actual time series data (I believe your goal is not Linnerrud dataset). In general, I'd consider adding batchnorm layer and/or dropout, which helps against overfitting and also tends to learn faster. Consider other activation functions: ELU, SELU. When you hit a hard limit with 1 hidden layer, maybe it's time to move on to a deep network. But each model needs careful examination, before making decisions - how the gradients are flowing, what is the distribution of activations, etc — Maxim, Oct 19 '17 at 15:33
Yes I would like to predict a multivariate time series with exogenous data, I found this tutorial (https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/) and others made by the same guy, but nowhere I could find a good and simple implementation of a regression MLP with Tensorflow rather than Keras. I would like to have more flexibility than what Keras allows, although I know both Keras and tensorflow quite badly, I only have a few days of trial and errors — Euler_Salter, Oct 19 '17 at 15:40
The point is that I can't give you the data, and also, when using it on the data, the train error flattens out on a very high value and never goes below that threshold. I will definitely try DropOuts and similar things though! For now, do you think the NN I provide is totally functioning or does it have minor bugs? — Euler_Salter, Oct 19 '17 at 15:41
What I can say definitely is that simple NN without built-in normalizing mechanism (batch norm, weight norm, SELU, etc.) is *very* sensitive to hyperparameters. It's so often that one or two changes in init stdev, learning rate or some other param makes NN learn much better. This question is one of those cases. So I wouldn't give up on NN so quickly — Maxim, Oct 19 '17 at 15:48
Speaking of hyperparameter optimization, there's an approach I think you should take a look at - https://stackoverflow.com/q/41860817/712995 — Maxim, Oct 19 '17 at 15:50
Thank you for all this information! I've already accepted your answer. However, if you feel like you want to and you have the time, would you be able to implement some of those mechanisms in the TensorFlow code? I know very little about NN in general, so I am not sure I know what you mean by built-in normalizing mechanisms. If you don't have the time, don't worry! — Euler_Salter, Oct 19 '17 at 15:58
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/157080/discussion-between-maxim-and-euler-salter). — Maxim, Oct 19 '17 at 16:00

tensorflow neural network multi layer perceptron for regression example

1 Answers1