Matching speed of R apply in Python

Question

I would like to efficiently apply a complicated function to rows of a matrix in Python (EDIT: Python 3). In R this is the apply function and its cousins, and it works quickly.

In Python I understand this can be done in a few ways. List comprehension, numpy.apply_along_axis, panas.dataframe.apply.

In my coding these Python approaches are very slow. Is there another approach I should use? Or perhaps my implementation of these Python approaches is incorrect?

Here's an example. The math is taken from a probit regression model. To be clear my goal is not to execute a probit regression, I am interested in an efficient approach to applying.

In R:

> n = 100000
> p = 7
> x = matrix(rnorm(700000, 0 , 2), ncol = 7)
> beta = rep(1, p)

> start <- Sys.time()
> test <- apply(x, 1, function(t)(dnorm(sum(t*beta))*sum(t*beta)/pnorm(sum(t*beta))) )
> end <- Sys.time()
> print(end - start)
Time difference of 0.6112201 secs

In Python via comprehension:

import numpy as np
from scipy.stats import norm
import time

n = 100000
p = 7
x = norm.rvs(0, 2, n * p)
x = x.reshape( (n , p) )
beta = np.ones(p)

start = time.time()
test = [ 
norm.pdf(sum(x[i,]*beta))*sum(x[i,]*beta)/norm.cdf(sum(x[i,]*beta)) 
for i in range(100000) ]
end = time.time()
print (end - start)
23.316735982894897

In Python via pandas.dataframe.apply:

frame = DataFrame(x)
f = lambda t: norm.pdf(sum(t))*sum(t)/norm.cdf(sum(t))
start = time.time()
test = frame.apply(f, axis = 1)
end = time.time()
print(end - start)
34.39404106140137

In this question the most upvoted response points out that apply_along_axis is not for speed. So I am not including this approach.

Again, I am interested in performing these calculations quickly. I truly appreciate your help!

python 2 or python 3? In python 2, avoid `range(100000)`, as it has to allocate the list, use `xrange` instead. In your special case, avoid both in favor of `[norm.pdf(sum(x_*beta))*sum(x_*beta)/norm.cdf(sum(x_*beta)) for x_ in x.transpose()]`. What you really want is possibly to construct a ufunc with numpy. — Marcus Müller, Oct 22 '17 at 21:23
I’d like to suggest [cupy](https://github.com/cupy/cupy)? Required CUDA libraries and CUDA enabled GPU... — agtoever, Oct 22 '17 at 21:28
@percusse this expression is part of the gradient of the likelihood of a probit regression model. I know libraries can execute probit regression for me, but this expression illustrates the type of calculations I'd like to do. — XRK, Oct 22 '17 at 21:32
Also see [my Python performance answer with quick benchmarks](https://stackoverflow.com/a/44995079/3056710) on a related question. — agtoever, Oct 22 '17 at 21:32
@MarcusMüller Python 3, edited original question. I'll look at your suggestions in detail. Thank you! — XRK, Oct 22 '17 at 21:33
Then you can start summing the array once and referring to the sum entry along that axis — percusse, Oct 22 '17 at 21:34

Reti43 · Answer 1 · 2017-10-22T23:04:09.663

2

List comprehensions use python level loops which are very inefficient. You should modify your code to take advantage of numpy vectorisation. If you change what you have between the time.time() calls with

xbeta = np.sum(x * beta, axis=1)
test = norm.pdf(xbeta) * xbeta / norm.cdf(xbeta)

you'll see a massive difference. For my machine it completes in 0.02 seconds. I've tested it against your list comprehension for the peace of mind and they both give the same result.

xbeta is something you wastefully compute many times in your calculation. By summing along the 2nd axis, we collapse it to a 1D array, which are your 100000 rows. Now all the calculations deal with 1D arrays, so we let numpy take care of the rest.

edited Oct 22 '17 at 23:04

answered Oct 22 '17 at 21:41

Reti43

9,656
3
28
44

Actually *apply* family in R are loops under the hood and arguably not vectorized: [Is the “*apply” family really not vectorized?](https://stackoverflow.com/q/28983292/1422451). – Parfait Oct 22 '17 at 22:15
@Parfait Your comment is appreciated since I don't know R. I just likened its behaviour to numpy. So, is my understanding correct to say that `apply` uses R-level loops, but they just happen to be more efficient that what pure Python does? – Reti43 Oct 22 '17 at 22:28
See above link, especially the [OP's own answer](https://stackoverflow.com/a/29006276/1422451) *apply* are simply hidden loops that runs loops at the C-level but calls R-level functions (R like Python is written on top of C and some bit of Fortran). *Vectorzied* is loosely defined as machine level only computing. – Parfait Oct 22 '17 at 22:40
@Parfait Yes, I meant specifically about `apply`, not the *apply family. – Reti43 Oct 22 '17 at 22:59

olooney · Answer 2 · 2017-10-23T00:17:55.273

1

This seems to be what you want:

y = np.dot(x, beta)
test2 = norm.pdf(y) * y / norm.cdf(y)

# let's compare it to the expected output:
np.allclose(test2, np.array(test))
True

This computes all 100000 values in your "test" list (to within numerical tolerance), but run in about 11.5 ms, according to ipython:

%time y = np.dot(x, beta); test2 = norm.pdf(y) * y / norm.cdf(y)
CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 15.2 ms

This improves on your version by:

eliminating the common sum(x[i,]*beta) expression
vectorizing the loop from i=0 to 100000 into a dot product.

Of the two, the second is vastly more important.

I also note that your above code uses sum, which is a python built-in that sums over any iterator. You almost certainly should be using np.sum which the numpy vectorized version specialized for numpy arrays. I have to mention this as an aside, because once I rewrote you code in "batch" form, the sum was implicit in the dot product and so np.sum doesn't appear in the final version.

edited Oct 23 '17 at 00:17

answered Oct 22 '17 at 22:28

olooney

2,467
1
16
25

Hmmm...comparing yours and OP's numpy and pandas, `np.all()` returns *False*. – Parfait Oct 22 '17 at 22:41
@Parfait, OP did the same problem two different ways, only experimenting with pandas to see if it was faster (it was not.) There's no reason to use pandas at all, so we should ignore that part. I'm comparing `test2` computed by my code to the first `test` list computed by the python list comprehension several lines above. – olooney Oct 22 '17 at 22:58
Correct. I compared your *test2* and OP's *test* and `np.all()` does not return *True*. – Parfait Oct 22 '17 at 23:52
Yeah? It's exactly equal on my machine, but I guess it could be sensitive to floating point rounding. What does `np.allclose(test2, np.array(test))` say on your machine? Or failing that how big is `np.mean( np.abs(test2-np.array(test)) )`? – olooney Oct 23 '17 at 00:02
Now I get *True* with `np.allclose()` with very small difference: `3.60353489474e-14`. I'm on ubuntu 64-bit / python 3.5 / numpy 1.12.1 – Parfait Oct 23 '17 at 00:06
@Parfait makes sense. Floating point operations accumulate errors in their least significant bits on different architectures, which is why it's best practice to use `np.isclose()` and `np.allclose()` to compare floating point numbers instead of equality. I'll update my answer. – olooney Oct 23 '17 at 00:15

Matching speed of R apply in Python

2 Answers2