This is a question based on (or follow-up of) another question: Faster implementation of ReLU derivative.
In a spirit to come up with a fastest way of computing the derivative, I wrote some solutions of which one is:
In [35]: np.random.seed(0)
In [36]: X = np.random.randn(3072,10000)
# computing ReLU derivative
In [42]: np.ceil(np.clip(X, 0, 1))
While benchmarking this to other solutions of Divakar, I found out that the above approach is excruciatingly slow (north of 30x). Below are the timings (from fastest to slowest)
In [43]: %timeit -n100 ne.evaluate('X>=0').view('i1')
10.6 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %timeit -n100 (X>=0).view('i1')
13.6 ms ± 77.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [45]: %timeit -n100 ne.evaluate('(X>=0)+0')
22.1 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# the super slowest one
In [46]: %timeit -n100 np.ceil(np.clip(X, 0, 1))
317 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
What is/are the factor(s) causing this slowness? Where does the bottleneck lie?