Tensorflow too slow when minimizing a loss function

Question

I have a code that looks like the following, where I want to minimize a function my_cost with respect to parameters w.

However, when running the code, it appears to me that it is very slow (like 30 times slower) compared to same one implemented without tensorflow (by explicitly defining a function that gives the gradient of the cost).

Am I doing something wrong in the following example code? (maybe I am unnecessarily re-computing the gradients graph each time?)

I am using Python 3, and TensorFlow 2.0.0. Relevant Git

In the following code, I am using a simple dummy cost function just as an example to show the big difference in the runtime.

Code with Tensorflow:

import numpy as np
import tensorflow as tf
import time

class ExampleTF:
    def __init__(self, n=100, m=10):
        Z = np.random.randn(n, m)
        self.Z = tf.convert_to_tensor(Z, dtype=tf.float32)
        self.w = tf.Variable(np.ones((m, 1)), dtype=tf.float32)

    # =====================================
    def cost(self, P):
        # This is a simple dummy cost function just as an example
        return tf.reduce_sum((self.Z @ self.w) - P)

    # =====================================
    def optimize_w(self, cost_func, parameters, lr=0.01, iterations=2000):
        optimizer = tf.optimizers.Adam(lr)
        for _ in range(iterations):
            optimizer.minimize(cost_func, var_list=parameters)

    # =====================================
    def update(self, P):
        P = tf.convert_to_tensor(P, dtype=tf.float32)

        self.optimize_w(
            cost_func = lambda: self.cost(P),
            parameters = [self.w]
        )

        #print("===> cost:", self.cost(P).numpy())
        #print("w:", self.w.numpy().reshape(-1)[:10])

# =====================================
n, m = 10000, 100
ex_tf = ExampleTF(n, m)
for _ in range(50):
    P = np.random.uniform(size=n).reshape((-1, 1))

    start = time.time()
    ex_tf.update(P)
    elapsed = time.time() - start

    print("elapsed time:", elapsed)

Code without Tensorflow (just numpy) :

import numpy as np
import tensorflow as tf
import time

class ExampleNonTF:
    def __init__(self, n=100, m=10):
        self.Z = np.random.randn(n, m)
        self.w = np.ones((m, 1))

    # =====================================
    def cost(self, P):
        # This is a simple dummy cost function just as an example
        return np.sum(self.Z @ self.w - P)

    # =====================================
    def gradient_cost(self, P):
        # This is the gradient of the dummy cost function with respect to self.w
        return np.sum(self.Z, axis=0).reshape(self.w.shape)

    # =====================================
    def optimize_w(self, P, lr=0.01, iterations=2000): # This is the ADAM optimizer
        avg_grad1 = 0; avg_grad2 = 0
        beta1 = 0.9; beta2 = 0.999; eps = 1e-07
        for itr in range(iterations):
            grad = self.gradient_cost(P)
            avg_grad1 = beta1 * avg_grad1 + (1 - beta1) * grad
            avg_grad2 = (beta2 * avg_grad2 + (1 - beta2) * (grad ** 2))
            avg_grad1_corr = avg_grad1 / (1 - beta1 ** (itr + 1))
            avg_grad2_corr = avg_grad2 / (1 - beta2 ** (itr + 1))
            self.w = self.w - lr * (avg_grad1_corr / (np.sqrt(avg_grad2_corr) + eps))

    # =====================================
    def update(self, P):
        self.optimize_w(P)

        #print("===> cost:", self.cost(P))
        #print("w:", self.w.reshape(-1)[:10])

# =====================================
n, m = 10000, 100
ex_nontf = ExampleNonTF(n, m)
for _ in range(50):
    P = np.random.uniform(size=n).reshape((-1, 1))

    start = time.time()
    ex_nontf.update(P)
    elapsed = time.time() - start

    print("elapsed time:", elapsed)

Interesting. Can you give a more precise time ratio figure for the `n, m = 3000, 200` case (TF to non-TF)? Can average over 20 iterations (`in range(20)`). Also, how is your non-TF gradient function implemented? (numpy?) — OverLordGoldDragon, Nov 10 '19 at 17:48
@OverLordGoldDragon Yes the non-TF gradient function is implemented with numpy, and it is extremely much faster than the code I posted above, I can notice a big in the order of 1 mn over 20 iterations. I think that my function `minimize` is not implemented correctly in the code I posted above. I am calling `optimizer.minimize(cost_func, parameters)` in a loop which -maybe- causes it to re-compute the graph each time it is called. Should I rather call `minimize_ops = optimizer.minimize(cost_func, parameters)` once and then use `minimize_ops` somehow ? Check my function `minimize(...)`. — shn, Nov 10 '19 at 18:02
1 mn = 1 _million_? Right, I'll give `minimize` a better look - it is unoptimal, but I wouldn't expect the difference to be this dramatic. — OverLordGoldDragon, Nov 10 '19 at 18:06
@OverLordGoldDragon the ratio of posted TF to non-TF is about 31. — shn, Nov 10 '19 at 18:32
Very strange behavior; no data processing overhead culprits, majority time was profiled to be raw cost & gradient computation. Disabling Eager makes things even worse. The only untested factor I can think of is using `tf.optimizers.Adam`; have you tried `tf.GradientTape` instead? For a small-scale operation like this, the raw computations themselves could have lots of unnecessary overheads themselves. I'd suggest opening a Github issue on this, and seeing a [relevant SO](https://stackoverflow.com/questions/58441514/why-is-tensorflow-2-much-slower-than-tensorflow-1) for reference. — OverLordGoldDragon, Nov 10 '19 at 19:51
@OverLordGoldDragon I get the same issue (too slow) using `tf.GradientTape` and then `optimizer.apply_gradients` (or I am maybe also doing it wrong). I am still not sure if calling `optimizer.minimize(..)` in a loop of 2000 iterations is the right way to do that, or if we need to call it once and use the Operation it returns in the loop of loop of 2000 iterations. If you think tf v2.0 is the issue, can you provide the (maybe faster) equivalent in tf v1 ? — shn, Nov 10 '19 at 20:17
If you don't set up a loop, whatever you use instead will; TF2 alone can't yield a 30x discrepancy, something's with the code. I'm not too familiar w/ 'manual' computations like these, but I could take another look if you provided your full numpy code — OverLordGoldDragon, Nov 10 '19 at 20:23
@OverLordGoldDragon I have updated the code with a TF version and a NON-TF version (just numpy). I have used a very simple dummy cost function which has a simple gradient to compute (just to show the performance problem). — shn, Nov 10 '19 at 21:24
As I suspected, my "small-scale" idea was accurate; run `n, m = int(1e5), int(1e4); range(1); iterations=25`. TF is meant for large-scale computations, where 100x10 is _far_ from; thus, internal pre/post-processing costs dominate. I'm not certain this is the 'entire' story or whether it is satisfactory, but doubt you'll find much more than this - if acceptable, I'll post it as an answer w/ our discussion summary, and maybe some improvement tips. — OverLordGoldDragon, Nov 11 '19 at 04:57
@OverLordGoldDragon Not sure that's accurate. An ANN implemented with tensorflow is as fast as a numpy implementation even with a dataset with dimensions `n, m = 3000, 200`; why would our simple dummy cost function be much slower to optimize with TF using the same dimensions! Do we have the same issue with TF v1 (you are welcome to test with v1 if you can) ? BTW1 I have now updated the numpy code as it wasn't optimized (non-TF still faster than TF even with `n, m = 100000, 100` which I consider big). BTW2 I have opened an issue on guithub: https://github.com/tensorflow/tensorflow/issues/34144 — shn, Nov 11 '19 at 07:45
@OverLordGoldDragon Also, I tried the same code with pytorch. For smaller scale `n < 100000` pytorch is faster, for larger scale `n > 100000` tensorflow starts to become better. So you were not totally wrong about your "scale" idea. You can post it as an answer with our discussion summary. — shn, Nov 11 '19 at 14:39
I was just pondering the "do it via ANN" idea after commenting; will try it myself. This is an interesting question, I'll continue to look into it - no 'answer' yet per the new revelation. — OverLordGoldDragon, Nov 11 '19 at 16:56

Tensorflow too slow when minimizing a loss function

0 Answers0