Efficient computation of entropy-like formula (sum(xlogx)) in Python

Question

I'm looking for an efficient way to compute the entropy of vectors, without normalizing them and while ignoring any non-positive value.

Since the vectors aren't probability vectors, and shouldn't be normalized, I can't use scipy's entropy function.

So far I couldn't find a single numpy or scipy function to obtain this, and as a result my alternatives involve breaking the computation into 2 steps, which involve intermediate arrays and slow down the run time. If anyone can think of a single function for this computation it will be interseting.

Below is a timeit script for measuring several alternatives at I tried. I'm using a pre-allocated array to avoid repeated allocations and deallocations during run-time. It's possible to select which alternative to run by setting the value of func_code. I included the nansum offered by one of the answers. The measurements on My MacBook Pro 2019 are: matmul: 16.720187613 xlogy: 17.296380516 nansum: 20.059866123000003

import timeit

import numpy as np
from scipy import special


def matmul(arg):
    a, log_a = arg
    log_a.fill(0)
    np.log2(a, where=a > 0, out=log_a)
    return (a[:, None, :] @ log_a[..., None]).ravel()


def xlogy(arg):
    a, log_a = arg
    a[a < 0] = 0
    return np.sum(special.xlogy(a, a), axis=1) * (1/np.log(2))


def nansum(arg):
    a, log_a = arg
    return np.nansum(a * np.log2(a, out=log_a), axis=1)


def setup():
    a = np.random.rand(20, 1000) - 0.1
    log = np.empty_like(a)
    return a, log


setup_code = """
from __main__ import matmul, xlogy, nansum, setup
data = setup() 
"""
func_code = "matmul(data)"

print(timeit.timeit(func_code, setup=setup_code, number=100000))

Temporary allocations are an unfortunate necessity for almost anything in Numpy. You can try numexpr instead which was designed specifically to avoid this issue: https://pypi.org/project/numexpr/ — Homer512, Jul 19 '22 at 09:10
re temporary allocations - right, but sometimes it can be avoided. For example, by using dot product instead of multiplying and summing in two separate functions. As I mentioned in the question, scipy's entropy is almost exactly what i need, and in principle could be working without temporary allocations, but it normalizes the vectors. re numexpr - I tried that in the past. It was nice in principle, but was slower than 2-3 separate numpy calls. — Assaf, Jul 19 '22 at 14:06
I experimented a little more with numexpr. Got better results when utilizing multiple threads, but worse results when limited to a single thread. I expected it to be faster since on paper it performs less log and less multiplication functions, and it avoids temporary allocations. I opened an issue at numexpr github: https://github.com/pydata/numexpr/issues/417 — Assaf, Jul 20 '22 at 09:36

score 1 · Answer 1 · answered Jul 21 '22 at 22:50

On my machine the computation of the logarithms takes about 80% of the time of matmul so it is definitively the bottleneck an optimizing other functions will result in a negligible speed up.

The bad news is that the default implementation np.log is not yet optimized on most platforms. Indeed, it is not vectorized by default, except on recent x86 Intel processors supporting AVX-512 (ie. basically Skylake processors on servers and IceLake processors on PCs, not recent AlderLake though). This means the computation could be significantly faster once vectorized. AFAIK, the close-source SVML library do support AVX/AVX2 and could speed up it (on x86-64 processors only). SMVL is supported by Numexpr and Numba which can be faster because of that assuming you have access to the non-free SVML which is a part of Intel tools often available on HPC machines (eg. like MKL, OneAPI, etc.).

If you do not have access to the SVML there are two possible remaining options:

Implement your own optimized SIMD log2 function which is possible but hard since it require a good understanding of the hardware SIMD units and certainly require to write a C or Cython code. This solutions consists in computing the log2 function as a n-degree polynomial approximation (it can be exact to 1 ULP with a big n though one generally do not need that). Naive approximations (eg. n=1) are much simple to implement but often too inaccurate for a scientific use).
Implement a multi-threaded log computation typically using Numba/Cython. This is a desperate solution as multithreading can slow things down if the input data is not large enough.

Here is an example of multi-threaded Numba solution:

import numba as nb

@nb.njit('(UniTuple(f8[:,::1],2),)', parallel=True)
def matmul(arg):
    a, log_a = arg
    result = np.empty(a.shape[0])
    for i in nb.prange(a.shape[0]):
        s = 0.0
        for j in range(a.shape[1]):
            if a[i, j] > 0:
                s += a[i, j] * np.log2(a[i, j])
        result[i] = s
    return result

This is about 4.3 times faster on my 6-core PC (200 us VS 46.4 us). However, you should be careful if you run this on a server with many cores on such small dataset as it can actually be slower on some platforms.

Many thanks for your thorough answer and analysis. I have some follow-up questions: (1) My project is a python library to be used on PC, Mac and servers. Some of these have older Intel chipsets, some newer, and some have Arm (e.g. Apple M1). I don't have the expertise to implement a simd log2 myself and I'm also limited to 1 thread. So I guess I should get access to SVML, but I'm not sure how it is used and deployed. Do I need it only at compilation time or also at run time? do I need a different solution for Intel and Arm or there's some solution that covers both? like numba? — Assaf, Jul 23 '22 at 10:32
(2) - about your code - you have an 'if' inside the innter for-loop. Does this make simd optimization harder? I'm trying to see what should be a good way to write the code to make it easier to use simd. I thought about a setup of 4 for-loops: (a) for the log2 computation over all a's elements, (b) for multiplying the log2 results by a's elements (element-wise), (3) for replacing all nan values by 0, and (4) for summing the results? I'm a little concerend that this requires several intermediate arrays. I also don't like calling log2 on invalid values and then fixing the return value. — Assaf, Jul 23 '22 at 10:42
(1) AFAIK the switch to SVML can be done at runtime (dynamic library), via the libmath of Intel instead of the default one (which is easy to switch from one platform to another). If this does not work, it can be done at compile time but that does not prevent a project being used on other platforms (you just need not to use on M1 for example since it is not available and there is AFAIK proper no alternative). I think tools like Intel based Anaconda/Python already do that for you automatically (though I never used them). I am not sure the Intel accelerated Numpy use the SVML but Numba can. — Jérôme Richard, Jul 23 '22 at 12:19
(2) The if indeed makes optimization harder but I do not expect Numba (or GCC/Clang) to automatically-vectorize the loop anyway. I only expect `np.log2` to do that and only if the SVML is used. Theoretically, the Glibc implemented faster SIMD version few years ago but most tool still do not use it and I do not expect this to change in a short time. The goal of the provided Numba version is to use multiple threads not SIMD instructions. Splitting the loop is generally a good idea but the log2 takes all the time here so the benefit will be slow as said in the answer. — Jérôme Richard, Jul 23 '22 at 12:25
Creating array in a hot loop is very expensive and not a good idea for performance. If the array is created one and then filled multiple time in a loop, this is much better. It will still be slow with a pure Numpy loop but fast in Numba. The thing is the main problem is the lo2 call/vectorization. Compilers do not vectorize that because they do not see this is a log2. No hardware supports that except Intel Xeon Phi. Vectorization is possible only because people wrote manual SIMD implementation of the function that compilers can use instead of the usual call. — Jérôme Richard, Jul 23 '22 at 12:31

score 0 · Answer 2 · answered Jul 19 '22 at 09:38

0

Having np.log2 of negative numbers (or zero) just gives a runtime warning and sets those values to np.nan, which is probably the best way to deal with them. If you don't want them to pollute your sum, just use

np.nansum(v_i*np.log2(v_i))

answered Jul 19 '22 at 09:38

Giovanni Tardini

558
3
7

Thanks! I wasn't aware of nansum. I think the downside of this approach is that it breaks the computation into 3 parts, but I need to measure it to tell for sure. I guess it's possible to use the 'out' argument to use one intermediate array and not two. – Assaf Jul 19 '22 at 10:53
I think this is a slightly better solution: let x be the 2d array of k vectors (shape = kxm) log2x = np.log2(x) ; log2x[np.isnan(log2x)] = 0 ; result = (x[:, None, :] @ log2x[..., None]).ravel() ; so there is an intermediate array of size kxm for the log computation, and two iterations over the full array to find the nan instances and to set them to 0. still wonder if we can find a solution without any intermediate array. – Assaf Jul 19 '22 at 14:59
@Assaf What about this: ```np.einsum("ij,ij->i", a, np.nan_to_num(np.log2(a), copy=False))```. The ```nan_to_num``` with ```copy=False``` avoids copying. It also avoids creating the temporary boolean array with ```isnan``` and then scanning that again for zero-setting. – Homer512 Jul 20 '22 at 09:55
@Assaf actually, I just looked a the source and ```nan_to_num``` calls ```isnan``` internally. Might still be the better choice because it also takes care of negative infinity which may occur close to zero or exactly zero. – Homer512 Jul 20 '22 at 10:02
@Homer512, thanks, that's an interesting alternative. I wasn't aware of nan_to_num. I tested it and it is slower than the alternatives. Accidentally, my original code was with einsum, and I switched to matmul following a reply here on SO - https://stackoverflow.com/questions/64231213/faster-alternative-to-numpy-einsum. I tried to replace einsum by matmul in your suggestion and it is still relatively slow. I think it is because it comptutes log for all entries in a, so it works for nothing on entries where the value is non-positive. nan_to_num might be a little slow too. – Assaf Jul 20 '22 at 11:31
Overall, my findings are that it is better to use np.log() with a 'where' statement that restricts the function to positive entries. In this way it doesn't spend time on computation that is ignored later when replacing nan and -inf by 0. So setting the output array (log_a) to zeros, and running log(a, where=a>0, out=a_log) is faster than computing log over all a's entries and then fixing the bad entries of nan and -inf. – Assaf Jul 20 '22 at 11:35

Efficient computation of entropy-like formula (sum(xlogx)) in Python

2 Answers2