2

This is a question based on (or follow-up of) another question: Faster implementation of ReLU derivative.

In a spirit to come up with a fastest way of computing the derivative, I wrote some solutions of which one is:

In [35]: np.random.seed(0)       
In [36]: X = np.random.randn(3072,10000) 

# computing ReLU derivative
In [42]: np.ceil(np.clip(X, 0, 1))

While benchmarking this to other solutions of Divakar, I found out that the above approach is excruciatingly slow (north of 30x). Below are the timings (from fastest to slowest)

In [43]: %timeit -n100 ne.evaluate('X>=0').view('i1')  
10.6 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [44]: %timeit -n100 (X>=0).view('i1')
13.6 ms ± 77.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [45]: %timeit -n100 ne.evaluate('(X>=0)+0') 
22.1 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# the super slowest one
In [46]: %timeit -n100 np.ceil(np.clip(X, 0, 1)) 
317 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is/are the factor(s) causing this slowness? Where does the bottleneck lie?

kmario23
  • 57,311
  • 13
  • 161
  • 150
  • That ceil/clip thing is just way more complex than anything else you're testing it against, especially with what appears to be numexpr optimizing some of the options. – user2357112 Mar 06 '19 at 05:49
  • 1
    (Also branch prediction. The branch predictor has a bad time with that `clip`.) – user2357112 Mar 06 '19 at 05:59

1 Answers1

3

First, you're just performing a way more complex sequence of operations. For each input, your ceil/clip thing does the following:

  • Is the input value less than 0? If so, set the intermediate value to 0.
  • Otherwise, is it greater than 1? If so, set the intermediate value to 1.
  • Otherwise, set the intermediate value to the input value.
  • Compute the ceiling of the intermediate value and set the output to that.

(This happens in two phases, one where all the clipping is done, one where all the ceil-ing is done.)

You're timing this against options that do the following for each input:

  • Perform a >= comparison between the input and 0 and set the output to that.

It's no surprise that the >= is faster.


Second, your ceil/clip thing is writing 16 times as many bytes as the >=. The >= produces a single byte of output per input element (the view is a view, so no data copy there), while your ceil/clip thing produces an intermediate array and an output array, both of dtype float64.


Third, the branch predictor has a bad time with that clip on a random array. It has no idea what branch will be taken each time. A more predictable array goes through clip much faster:

In [21]: %timeit X.clip(0, 1)
1 loop, best of 5: 211 ms per loop

In [22]: A = np.full_like(X, 0.5)

In [23]: %timeit A.clip(0, 1)
10 loops, best of 5: 86.6 ms per loop

Finally, at least on the machine and NumPy build I tested on, numpy.ceil is just surprisingly slow:

In [24]: %timeit np.ceil(X)
10 loops, best of 5: 166 ms per loop

I'm not sure whether it's hitting a software ceil implementation or what. This is probably going to be different on different builds.

user2357112
  • 260,549
  • 28
  • 431
  • 505