Stochastic gradient descent in Tensorflow seems conceptually wrong

Question

I am exploring Linear Regression with Tensorflow. Here is my code from this notebook.

import tensorflow as tf
import numpy as np
learning_rate = 0.01

x_train = np.linspace(-1,1,101)
y_train = 2*x_train + np.random.randn(*x_train.shape) * 0.33

X = tf.placeholder("float")
Y = tf.placeholder("float")
def model(X, w):
    return tf.multiply(X,w)
w = tf.Variable(0.0, name = "weights")

training_epochs = 100
y_model = model(X,w)
cost = tf.reduce_mean(tf.square(Y-y_model))
train_op = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    for epoch in range(training_epochs):
        for (x,y) in zip(x_train,y_train):
            sess.run(train_op, feed_dict = {X:x, Y: y})
        print(sess.run(w))

It tries to minimize a cost function. According to this question's answers, I think tf.reduce_mean() will work like np.mean().

However, every time a pair of (x,y) is fed to the train_op, the weight w seems to update not according to THE pair but to all previous pairs.

What is the explanation for that? Is this related to working together with the optimizer?

What do you mean by `However, every time a pair of (x,y) is fed to the train_op, the weight w seems to update not according to THE pair but to all previous pairs.` — Umang Gupta, Jul 13 '18 at 21:41
@UmangGupta Hi! The intution of the code for me is that every time *sess.run(train_op, feed_dict = {X:x, Y: y})* runs, the *w* is updated with respect to that pair of **x,y** or **x_train[i], y_train[i]** , so basically we should get the slope, y/x for w's value. — expectedAn, Jul 13 '18 at 21:47
yes, your understanding is somewhat ok. But there are a few caveats. Slope = dloss/dx which is different from y/x in general and it should be equal to change in w not w right away. — Umang Gupta, Jul 13 '18 at 21:55
According to _model()_, the _w_ that minimize _square(y-y_model)_ should be _y/x_, right? So just the slope of the line connecting dots (x,y) and (0,0). — expectedAn, Jul 13 '18 at 22:02
No, your understanding of the optimization process and gradient is completely wrong. — Dr. Snoopy, Jul 13 '18 at 22:07
Yes the optimal slope is y/x but for gradient descent, we take a step in direction of dloss/dw which would be 2(y-y_model)*x. Pardon my previous comment, I meant delta w will be dloss/dw not dloss/dx — Umang Gupta, Jul 13 '18 at 22:10
If you keep doing GD on that single point you will eventually get weight equal to y/x or close to that — Umang Gupta, Jul 13 '18 at 22:12
@MatiasValdenegro I know how gradient descent and optimization works. I don't know how _this piece of code_ works. — expectedAn, Jul 13 '18 at 22:19

score 0 · Answer 1 · answered Jul 13 '18 at 22:46

0

I would like to answer my own question. This is not a trivial question if you think this does exactly linear regression.

I misunderstood the performance of tf.train.GradientDescentOptimizer. It only run one step to minimize the loss function, not to the minium value. If so, @UmangGupta is right that we get the slope.
In each epoch, the optimizer try to optimize the loss function with respect to each data point "a little bit". Therefore, the order of how you feed data to the optimizer matters. So the following code will give a different answer.

for (x,y) in list(zip(x_train,y_train))[::-1]: sess.run(train_op, feed_dict = {X:x, Y: y})

In a word, this piece of code doesn't run a rigorious linear regression, but an approximation of it.

answered Jul 13 '18 at 22:46

expectedAn

96
9

1

I think you misunderstood linear regression too. Typically linear regression is not done over batches but it considers data as a whole and uses some solving technique (pseudo-inverse/GD whatever). The above code is exactly linear regression with "batched" stochastic gradient descent. – Umang Gupta Jul 13 '18 at 22:49
Also, if you think of linear regression as fitting a single point and iterating so over all the points. You should really revisit regression – Umang Gupta Jul 13 '18 at 22:52
@UmangGupta I get your point. There is no "rigorious" one after all. Thanks. – expectedAn Jul 16 '18 at 15:33

score -1 · Answer 2 · answered Jul 13 '18 at 22:48

-1

If you change this piece of your code

for epoch in range(training_epochs):
    for (x,y) in zip(x_train,y_train):
        sess.run(train_op, feed_dict = {X:x, Y: y})

by this

for (x,y) in zip(x_train,y_train):
    for epoch in range(training_epochs):
        sess.run(train_op, feed_dict = {X:x, Y: y})

do you get what you expect?

In your original code, the first loop refers to the iterations, so you are fixing the first iteration of gradient descent and then applying it with respect to all previous pairs (because your second loop refers to all previous pairs), then you're fixing a second iteration and again you're applying gradient descent with respect to all previous pairs, and so on.

If you interchange your loops as above, then you're fixing a pair and then applying all your iterations of gradient descent to that single pair. I'm not sure if this is what you wanted.

answered Jul 13 '18 at 22:48

antonioACR1

1,303
2
15
28

This is as good as running the linear regression on last data point., which is totally wrong – Umang Gupta Jul 13 '18 at 22:51
@Umang Gupta The person who is asking this question seems to be confused about why the code is updating with respect to all pairs and not a single pair, and I'm suggesting that this is due to the way he is using his loops. I don't see why it has to be good or bad to apply an optimizer to the last data point. – antonioACR1 Jul 13 '18 at 23:03
Please tell me where in the question the user is asking about the best predictive model to fit the data. It's only about why is updating with respect to all pairs instead of a single one. – antonioACR1 Jul 13 '18 at 23:06
And it is not updating w.r.t to all pairs in single iteration which is what the OP is asking. Although fair point the question is kind of posted in a confusing way – Umang Gupta Jul 13 '18 at 23:08
It is updating with respect to all pairs in a single operation! This line is clear: `for epoch in range(training_epochs):` is fixing the iteration, then `for (x,y) in zip(x_train,y_train): sess.run(train_op, feed_dict = {X:x, Y: y})` is running the optimizer to all pairs with respect to that fixed iteration! – antonioACR1 Jul 13 '18 at 23:11

Stochastic gradient descent in Tensorflow seems conceptually wrong

2 Answers2