Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++

Question

I am currently working on a machine learning project where - given a data matrix Z and a vector rho - I have to compute the value and slope of the logistic loss function at rho. The computation involves basic matrix-vector multiplication and log/exp operations, with a trick to avoid numerical overflow (described in this previous post).

I am currently doing this in Python using NumPy as shown below (as a reference, this code runs in 0.2s). Although this works well, I would like to speed it up since I call the function multiple times in my code (and it represents over 90% of the computation involved in my project).

I am looking for any way to improve the runtime of this code without parallelization (i.e. only 1 CPU). I am happy using any publicly available package in Python, or calling C or C++ (since I have heard that this improves runtimes by an order of magnitude). Preprocessing the data matrix Z would also be OK. Some things that could be exploited for better computation are that the vector rho is usually sparse (with around 50% of entries = 0) and there are usually far more rows than columns (in most cases n_cols <= 100)

import time
import numpy as np

np.__config__.show() #make sure BLAS/LAPACK is being used
np.random.seed(seed = 0)

#initialize data matrix X and label vector Y
n_rows, n_cols = 1e6, 100
X = np.random.random(size=(n_rows, n_cols))
Y = np.random.randint(low=0, high=2, size=(n_rows, 1))
Y[Y==0] = -1
Z = X*Y # all operations are carried out on Z

def compute_logistic_loss_value_and_slope(rho, Z):
    #compute the value and slope of the logistic loss function in a way that is numerically stable
    #loss_value: (1 x 1) scalar = 1/n_rows * sum(log( 1 .+ exp(-Z*rho))
    #loss_slope: (n_cols x 1) vector = 1/n_rows * sum(-Z*rho ./ (1+exp(-Z*rho))
    #see also: https://stackoverflow.com/questions/20085768/

    scores = Z.dot(rho)
    pos_idx = scores > 0
    exp_scores_pos = np.exp(-scores[pos_idx])
    exp_scores_neg = np.exp(scores[~pos_idx])

    #compute loss value
    loss_value = np.empty_like(scores)
    loss_value[pos_idx] = np.log(1.0 + exp_scores_pos)
    loss_value[~pos_idx] = -scores[~pos_idx] + np.log(1.0 + exp_scores_neg)
    loss_value = loss_value.mean()

    #compute loss slope
    phi_slope = np.empty_like(scores)
    phi_slope[pos_idx]  = 1.0 / (1.0 + exp_scores_pos)
    phi_slope[~pos_idx] = exp_scores_neg / (1.0 + exp_scores_neg)
    loss_slope = Z.T.dot(phi_slope - 1.0) / Z.shape[0]

    return loss_value, loss_slope


#initialize a vector of integers where more than half of the entries = 0
rho_test = np.random.randint(low=-10, high=10, size=(n_cols, 1))
set_to_zero = np.random.choice(range(0,n_cols), size =(np.floor(n_cols/2), 1), replace=False)
rho_test[set_to_zero] = 0.0

start_time = time.time()
loss_value, loss_slope = compute_logistic_loss_value_and_slope(rho_test, Z)
print "total runtime = %1.5f seconds" % (time.time() - start_time)

Why are you excluding more than 1 CPU? Although the Python VM is essentially single threaded, you could invoke POSIX threads from within a C extension after you copy the data to a more thread-friendly data structure. There may be other reasons not to use multiple CPUs, but you aren't limited by that restriction if you escape to C. — rts1, Feb 02 '16 at 17:23
@rts Good question. In this case, I need to limit it to 1 CPU since the code that calls the `compute_logistic_loss_function` is actually parallelized... So only 1 CPU will be available when the function is called. — Berk U., Feb 02 '16 at 17:35
For large `n` the runtime seems to be dominated by `loss_slope = Z * (phi_slope - 1.0)`, which broadcasts out to the same size as `Z`. Since you're taking the mean over rows, you could re-write this as a dot product using `Z.T.dot(phi_slope).T / Z.shape[0]`, which gives about a factor of 4 speed-up on my machine. — ali_m, Feb 02 '16 at 18:04
The most expensive operations are matrix products, so you should make sure your version of numpy is linked against a fast BLAS library (e.g. OpenBLAS or MKL). If I were you, I would use multithreaded BLAS to parallelize the matrix products and ditch your parallelization over the calling Python code. — ali_m, Feb 02 '16 at 18:07
\*Sorry, I meant `Z.T.dot(phi_slope - 1.0).T / Z.shape[0]` in my comment above — ali_m, Feb 02 '16 at 18:11
@ali_m Thank you for pointing this out! I was actually doing the computation in the way you suggested in my actual build so I integrated it into the sample code (had written it different in the original post because it was easier to explain what was happening, though I did not realize that it would make such a huge difference). I also added a line at the end to make sure that BLAS was linked. Multithreading is really not an option in this case unfortunately. — Berk U., Feb 02 '16 at 20:03
Your description of `X` is confusing since it doesn't match your example code. Am I right in thinking that by "data matrix" you meant `Z` rather than `X`? Also, how sparse is `rho`? There might be huge savings to be had from doing sparse rather than dense matrix-vector multiplication. It would be helpful if you could make your example arrays match the dtype, sparsity and approximate range of values in your real data. — ali_m, Feb 02 '16 at 20:54
@ali_m Thanks for pointing that out. I changed the comments so that everything matches up. To answer the questions, half of the entries of `rho` are typically 0 as is the case in the code, which is probably something to exploit. However, `Z` is not typically sparse (in some cases, it will be exclusively composed only of -1,0,1; however, since I didn't put that in the sample code, I will leave it as a follow-up post). — Berk U., Feb 03 '16 at 00:18

jlh · Answer 1 · 2016-02-27T18:08:49.537

Libraries of the BLAS family are already highly tuned for best performance. So no effort to link to some C/C++ code is likely to give you any benefits. You could however try various BLAS implementations, since there are quite a few of them around, including some specially tuned to some CPUs.

The other thing that comes to my mind is to use a library like theano (or Google's tensorflow) that is able to represent the entire computational graph (all of the operations in your function above) and apply global optimizations to it. It can then generate CPU code from that graph via C++ (and by flipping a simple switch also GPU code). It can also automatically compute symbolic derivatives for you. I've used theano for machine learning problems and it's a really great library for that, although not the easiest one to learn.

(I'm posting this as an answer because it's too long for a comment)

Edit:

I actually had a go at this in theano, but the result is actually about 2x slower on the CPU, see below why. I'll post it here anyway, maybe it's a starting point for someone else to do something better: (this is only partial code, complete with the code from the original post)

import theano

def make_graph(rho, Z):
    scores = theano.tensor.dot(Z, rho)

    # this is very inefficient... it calculates everything twice and
    # then picks one of them depending on scores being positive or not.
    # not sure how to express this in theano in a more efficient way
    pos = theano.tensor.log(1 + theano.tensor.exp(-scores))
    neg = theano.tensor.log(scores + theano.tensor.exp(scores))
    loss_value = theano.tensor.switch(scores > 0, pos, neg)
    loss_value = loss_value.mean()

    # however computing the derivative is a real joy now:
    loss_slope = theano.tensor.grad(loss_value, rho)

    return loss_value, loss_slope

sym_rho = theano.tensor.col('rho')
sym_Z = theano.tensor.matrix('Z')
sym_loss_value, sym_loss_slope = make_graph(sym_rho, sym_Z)

compute_logistic_loss_value_and_slope = theano.function(
        inputs=[sym_rho, sym_Z],
        outputs=[sym_loss_value, sym_loss_slope]
        )

# use function compute_logistic_loss_value_and_slope() as in original code

score 0 · Answer 2 · answered Nov 15 '16 at 09:42

Numpy is quite optimized. The best you can do is to try other libraries with data of the same size initialized to random (not initialized to 0) and do your own benchmark.

If you want to try, you can of course try BLAS. You should also give a try to eigen, I personally found it faster on one of my applications.

Speeding up matrix-vector multiplication and exponentiation in Python, possibly by calling C/C++

2 Answers2

Linked