3

I am computing mean and standard deviation in numpy. To increase performance, I tried the same in Tensorflow but Tensorflow was at least ~10x slower. I tried 2 approaches in Tensorflow (code below). The first approach uses tf.nn.moments(), which has a bug causing it to sometimes return a negative value for variance. In the second approach I calculate variance via other Tensorflow functions.

I tried CPU-only and GPU; numpy is always faster.

I used time.time() rather than time.clock() in order to measure wall-clock time when using GPU.

Why is Tensorflow slower? I thought it might be due to transferring data into the GPU, but TF is slower even for very small datasets (where transfer time should be negligible), and when using CPU only. Is this due to overhead time required to initialize TF?

import tensorflow as tf
import numpy
import time
import math

class Timer:
    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, *args):
        self.end = time.time()
        self.interval = self.end - self.start

inData = numpy.random.uniform(low=-1, high=1, size=(40000000,))

with Timer() as t:
    mean = numpy.mean(inData)
print 'python mean', mean, 'time', t.interval

with Timer() as t:
    stdev = numpy.std(inData)
print 'python stdev', stdev, 'time', t.interval

# Approach 1 (Note tf.nn.moments() has a bug)
with Timer() as t:
    with tf.Graph().as_default():
        meanTF, varianceTF = tf.nn.moments(tf.constant(inData), axes=[0])
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            sess.run(init_op)
            mean, variance = sess.run([meanTF, varianceTF])
            sess.close()
print 'variance', variance
stdev = math.sqrt(variance)
print 'tensorflow mean', mean, 'stdev', stdev, 'time', t.interval

# Approach 2
with Timer() as t:
    with tf.Graph().as_default():
        inputVector = tf.constant(inData)
        meanTF = tf.reduce_mean(inputVector)
        length = tf.size(inputVector)
        varianceTF = tf.divide(tf.reduce_sum(tf.squared_difference(inputVector, mean)), tf.to_double(length))
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            sess.run(init_op)
            mean, variance = sess.run([meanTF, varianceTF])
            sess.close()
print 'variance', variance
stdev = math.sqrt(variance)
print 'tensorflow mean', mean, 'stdev', stdev, 'time', t.interval
Ron Cohen
  • 2,815
  • 5
  • 30
  • 45
  • ```I thought it might be due to transferring data into the GPU, but TF is slower even for very small datasets``` This looks, like you swapped something. I would say your kind of computations are simple, so numpy is reaching the limit very well due to specialized functions and usage of [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) (which might run in parallel depending on your BLAS-setup; e.g. default in Ubuntu). Tensorflow can't do much magic to be better (while guaranteeing same accuracy). – sascha Mar 09 '17 at 18:17
  • Tensorflow is consistently much slower than Numpy in my tests. Shouldn't Tensorflow be much faster since it uses GPU and Numpy uses only CPU? I am running Ubuntu and have not changed anything to affect BLAS (that I am aware of). – Ron Cohen Mar 09 '17 at 18:34
  • 1
    This always depends on the task. Some algorithms are nicely done in parallel, some are not (and you already mentioned other params like transfer, there are also dtypes and co.). Not everything is a nice job for the GPU. – sascha Mar 09 '17 at 18:35

3 Answers3

4

Below is a slightly better benchmark. Tested on Xeon V3 with TensorFlow CPU-only compiled with all optimization options + XLA from here vs. numpy MKL that comes with latest anaconda.

XLA probably didn't make a difference here, but left it in for posterity.

Notes:

  1. Exclude first couple of runs from timing, they can include initialization/profiling

  2. Use variables to avoid copying input into Tensorflow runtime.

  3. Perturb the variable between calls to make sure there's no caching

Result:

   numpy 23.5 ms, 25.7 ms
      tf 14.7 ms, 20.5 ms

Code:

import numpy as np
import tensorflow as tf
import time
from tensorflow.contrib.compiler import jit
jit_scope = jit.experimental_jit_scope

inData = np.random.uniform(low=-1, high=1, size=(40000000,)).astype(np.float32)
#inDataFeed = tf.placeholder(inData.dtype)

with jit_scope(compile_ops=True):
    inDataVar = tf.Variable(inData)
    meanTF = tf.reduce_mean(inDataVar)


sess = tf.Session()
times = []
sess.run(tf.global_variables_initializer())
num_tries = 10


times = []
for i in range(num_tries):
    t0 = time.perf_counter()
    mean = np.mean(inData)
    times.append(time.perf_counter()-t0)

print("%10s %.1f ms, %.1f ms" %("numpy", 10**3*min(times),
                                10**3*np.median(times)))

times = []
perturb = inDataVar.assign_add(tf.random_uniform(inData.shape))
for i in range(num_tries):
    sess.run(perturb)
    t0 = time.perf_counter()
    mean, = sess.run([meanTF])
    times.append(time.perf_counter()-t0)

times = times[2:] # discard first few because they could include profiling runs
print("%10s %.1f ms, %.1f ms" %("tf", 10**3*min(times),
                                10**3*np.median(times)))
Yaroslav Bulatov
  • 57,332
  • 22
  • 139
  • 197
  • Thanks Yaroslav. My original goal was to get a performance boost via GPU,. Do you know if there is substantial overhead time involved in loading data into the GPU and starting the GPU session? – Ron Cohen Mar 10 '17 at 19:26
  • yes, starting GPU session is sustantial overhead, it can take >30 seconds in some unlucky cases (when you use video card whose compute capability has not been compiled for, such as the case for GTX 1080) – Yaroslav Bulatov Mar 10 '17 at 19:49
2

The here is a benchmark from some guy, who claims that TF mean is significantly faster than in numpy or theano. The benchmark is here and was tested on

an Intel core i5-4460 CPU with 16GiB RAM and a Nvidia GTX 970 with 4 GiB RAM using Theano 0.8.2, Tensorflow 0.11.0, CUDA 8.0 on Linux Mint 18

enter image description here

Here are some other benchmarks, but they do not address mean.

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
1

Please find another benchmark and explanation at https://towardsdatascience.com/numpy-vs-tensorflow-speed-on-matrix-calculations-9cbff6b3ce04

fviktor
  • 2,861
  • 20
  • 24