Slowness of numpy.ceil and numpy.clip in ReLU thresholding: where is the bottleneck?

Question

This is a question based on (or follow-up of) another question: Faster implementation of ReLU derivative.

In a spirit to come up with a fastest way of computing the derivative, I wrote some solutions of which one is:

In [35]: np.random.seed(0)       
In [36]: X = np.random.randn(3072,10000) 

# computing ReLU derivative
In [42]: np.ceil(np.clip(X, 0, 1))

While benchmarking this to other solutions of Divakar, I found out that the above approach is excruciatingly slow (north of 30x). Below are the timings (from fastest to slowest)

In [43]: %timeit -n100 ne.evaluate('X>=0').view('i1')  
10.6 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [44]: %timeit -n100 (X>=0).view('i1')
13.6 ms ± 77.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [45]: %timeit -n100 ne.evaluate('(X>=0)+0') 
22.1 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# the super slowest one
In [46]: %timeit -n100 np.ceil(np.clip(X, 0, 1)) 
317 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

What is/are the factor(s) causing this slowness? Where does the bottleneck lie?

That ceil/clip thing is just way more complex than anything else you're testing it against, especially with what appears to be numexpr optimizing some of the options. — user2357112, Mar 06 '19 at 05:49
(Also branch prediction. The branch predictor has a bad time with that `clip`.) — user2357112, Mar 06 '19 at 05:59

score 3 · Accepted Answer · answered Mar 06 '19 at 06:27

First, you're just performing a way more complex sequence of operations. For each input, your ceil/clip thing does the following:

Is the input value less than 0? If so, set the intermediate value to 0.
Otherwise, is it greater than 1? If so, set the intermediate value to 1.
Otherwise, set the intermediate value to the input value.
Compute the ceiling of the intermediate value and set the output to that.

(This happens in two phases, one where all the clipping is done, one where all the ceil-ing is done.)

You're timing this against options that do the following for each input:

Perform a >= comparison between the input and 0 and set the output to that.

It's no surprise that the >= is faster.

Second, your ceil/clip thing is writing 16 times as many bytes as the >=. The >= produces a single byte of output per input element (the view is a view, so no data copy there), while your ceil/clip thing produces an intermediate array and an output array, both of dtype float64.

Third, the branch predictor has a bad time with that clip on a random array. It has no idea what branch will be taken each time. A more predictable array goes through clip much faster:

In [21]: %timeit X.clip(0, 1)
1 loop, best of 5: 211 ms per loop

In [22]: A = np.full_like(X, 0.5)

In [23]: %timeit A.clip(0, 1)
10 loops, best of 5: 86.6 ms per loop

Finally, at least on the machine and NumPy build I tested on, numpy.ceil is just surprisingly slow:

In [24]: %timeit np.ceil(X)
10 loops, best of 5: 166 ms per loop

I'm not sure whether it's hitting a software ceil implementation or what. This is probably going to be different on different builds.

Slowness of numpy.ceil and numpy.clip in ReLU thresholding: where is the bottleneck?

1 Answers1