How do I optimise numpy.packbits with numba?

Question

I'm trying to optimise numpy.packbits:

import numpy as np
from numba import njit, prange

@njit(parallel=True)
def _numba_pack(arr, div, su):
    for i in prange(div):
        s = 0
        for j in range(i*8, i*8+8):
            s = 2*s + arr[j]
        su[i] = s
        
def numba_packbits(arr):
    div, mod = np.divmod(arr.size, 8)
    su = np.zeros(div + (mod>0), dtype=np.uint8)
    _numba_pack(arr[:div*8], div, su)
    if mod > 0:
        su[-1] = sum(x*y for x,y in zip(arr[div*8:], (128, 64, 32, 16, 8, 4, 2, 1)))
    return su

>>> X = np.random.randint(2, size=99, dtype=bool)
>>> print(numba_packbits(X))
[ 75  24  79  61 209 189 203 187  47 226 170  61   0]

It appears 2 - 2.5 times slower than np.packbits(X). How is this implemeted in numpy internally? Could this be improved in numba?

I work on numpy == 1.21.2 and numba == 0.53.1 installed via conda install. My platform is:

Results:

import benchit
from numpy import packbits
%matplotlib inline
benchit.setparams(rep=5)

sizes = [100000, 300000, 1000000, 3000000, 10000000, 30000000]
N = sizes[-1]
arr = np.random.randint(2, size=N, dtype=bool)
fns = [numba_packbits, packbits]

in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of bits')
t.plot(logx=True, figsize=(12, 6), fontsize=14)

Update

With the response of Jérôme:

@njit('void(bool_[::1], uint8[::1], int_)', inline='never')
def _numba_pack_x64_byJérôme(arr, su, pos):
    for i in range(64):
        j = i * 8
        su[i] = (arr[j]<<7)|(arr[j+1]<<6)|(arr[j+2]<<5)|(arr[j+3]<<4)|(arr[j+4]<<3)|(arr[j+5]<<2)|(arr[j+6]<<1)|arr[j+7]
       
@njit(parallel=True)
def _numba_pack_byJérôme(arr, div, su):
    for i in prange(div//64):
        _numba_pack_x64_byJérôme(arr[i*8:(i+64)*8], su[i:i+64], i)
    for i in range(div//64*64, div):
        j = i * 8
        su[i] = (arr[j]<<7)|(arr[j+1]<<6)|(arr[j+2]<<5)|(arr[j+3]<<4)|(arr[j+4]<<3)|(arr[j+5]<<2)|(arr[j+6]<<1)|arr[j+7]
        
def numba_packbits_byJérôme(arr):
    div, mod = np.divmod(arr.size, 8)
    su = np.zeros(div + (mod>0), dtype=np.uint8)
    _numba_pack_byJérôme(arr[:div*8], div, su)
    if mod > 0:
        su[-1] = sum(x*y for x,y in zip(arr[div*8:], (128, 64, 32, 16, 8, 4, 2, 1)))
    return su

Usage:

>>> print(numba_packbits_byJérôme(X))
[ 75  24  79  61 209 189 203 187  47 226 170  61   0]

Results:

Does your code become faster when `X.size` is divisible by 8, so that `mod > 0` is false? For example, `X = np.random.randint(2, size=80, dtype=bool)` — ForceBru, Jan 14 '22 at 15:35
The [numpy implementation](https://github.com/numpy/numpy/blob/5cc7ef066fca7a821a2160b095578384c301ae3c/numpy/core/src/multiarray/compiled_base.c#L1618) is written in C and uses SIMD if available. What makes you think it can be accelerated with numba? — myrtlecat, Jan 14 '22 at 16:17
@myrtlecat One of the reasons is ability to do it in parallel. Another reason, there indeed exists some algorithms I managed to optimise in `numpy` vs `numba`, for example `numpy.ravel_multi_index` on my 8GB, 4 cores Windows processor. However, I've found there's no easy way to optimise `np.sum` or `np.argsort`. I would be glad if someone helped me to sort out which algorithms of `numpy` could be optimised and why, at least basic ones. — mathfux, Jan 14 '22 at 16:39
Check the `source` of the `numpy` function (or method). If it is `built-in`, that is already compiled. Your own `numba` code probably isn't going to be much of an improvement - unless `numba` already implements it. You'll get the biggest improvements when converting code/functions that use python-level iterations. — hpaulj, Jan 14 '22 at 17:11
You call `numba_movebits`, but shouldn't it be `numba_packbits`? As for Numpy, `np.sum` is currently sub-optimal. I worked on its basic vectorization and plan to make a PR soon but it will still be a bit sub-optimal unless you specify the `dtype` (see: [this post](https://stackoverflow.com/questions/70134026) for more information). For sorts, Intel recently vectorized a part of the code so it works much faster on processors supporting AVX-512. `argsort` is probably still not vectorized though as this is hard to do. AFAIR, the Numba implementation is currently a bit slower for sorts. — Jérôme Richard, Jan 14 '22 at 17:21
Thanks @hpaulj and @Jérôme for bringing me a light about guidelines and further info. I also bumped into `sort` and `argsort` while trying to optimise some of groupby algorithms so I guess it's not going to work more efficiently in forthcoming releases of `numpy` — mathfux, Jan 14 '22 at 17:39
On a Xeon Skylake-SP machine with 10 cores (and 20 threads), Numpy 1.20.3 and Numba 0.54.1, it turns out that your Numba implementation is 14 times *faster* than `np.packbits` on a 1G-sized array. It is also 2 times *faster* in sequential. So, the question is why it is slower on some platforms (including yours). It turns out that the Numpy implementation is not so well optimized (at least the default package)... — Jérôme Richard, Jan 14 '22 at 17:54
@JérômeRichard Thanks for pointing it out. I think it depends strongly on number of cores/threads. Not sure if `np.packbits` (or how) can do it multicore. I use `numpy == 1.21.2` and `numba == 0.53.1` — mathfux, Jan 14 '22 at 18:19
I just tested on both Linux and Windows on another machine with a 6-core Intel Skylake. The Windows behave the same way than previously (*event in sequential*). It has Numpy 1.20.3 and Numba 0.54.1. On Linux, Numpy is surprisingly much faster and Numba slower. I also use the same version on the Linux platform. So, the difference certainly comes from the PIP package for the target OS. My guess is that `np.packbits` is not vectorized on the PIP package of Windows. The PIP package for Linux use a SSE vectorization (which s not optimal but quite good). Can you add information about your platform? — Jérôme Richard, Jan 14 '22 at 18:42
Thank you. On Windows, the Numba version scale well, but not at all on Linux: it is only 1.5 time faster with 6 core... This is very surprising since this is an embarrassingly parallel code and none of the implementations appear to be memory bound on my Linux. I guess there is definitively something wrong with the Numba version (probably not related to your code). I have a possible explanation, but if this is what I think it is, this is going to be quite complex/deep. I will check my hypothesis. — Jérôme Richard, Jan 14 '22 at 19:37

score 2 · Accepted Answer · answered Jan 15 '22 at 03:29

There are several issue with the Numba implementation. One of them is that parallel loops breaks the constant propagation optimization in LLVM-Lite (the JIT-compiler used by Numba). This cause critical information like array strides not to be propagated resulting in a slow scalar implementation instead of an SIMD one, and additional unneded instructions so to compute the offsets. Such issue can also be seen in C code. Numpy added specific macros so help compilers to automatically vectorize the code (ie. use SIMD instructions) when the stride of the working dimension is actually 1.

A solution to overcome the constant propagation issue is to call another Numba function. This function must not be inlined. The signature should be manually provided so the compiler can know the stride of the array is 1 at compilation time and generate a faster code. Finally, the function should work on fixed-size chunks because function calls are expensive and the compiler can vectorize the code. Unrolling the loop with shifts also produce a faster code (although it is uglier). Here is an example:

@njit('void(bool_[::1], uint8[::1], int_)', inline='never')
def _numba_pack_x64(arr, su, pos):
    for i in range(64):
        j = i * 8
        su[i] = (arr[j]<<7)|(arr[j+1]<<6)|(arr[j+2]<<5)|(arr[j+3]<<4)|(arr[j+4]<<3)|(arr[j+5]<<2)|(arr[j+6]<<1)|arr[j+7]

@njit('void(bool_[::1], int_, uint8[::1])', parallel=True)
def _numba_pack(arr, div, su):
    for i in prange(div//64):
        _numba_pack_x64(arr[i*8:(i+64)*8], su[i:i+64], i)
    for i in range(div//64*64, div):
        j = i * 8
        su[i] = (arr[j]<<7)|(arr[j+1]<<6)|(arr[j+2]<<5)|(arr[j+3]<<4)|(arr[j+4]<<3)|(arr[j+5]<<2)|(arr[j+6]<<1)|arr[j+7]

Benchmark

Here are performance results on my 6-core machine (i5-9600KF) with a billion random items as input:

Initial Numba (seq):    189 ms  (x0.7)
Initial Numba (par):    141 ms  (x1.0)
Numpy (seq):             98 ms  (x1.4)
Optimized Numba (par):   35 ms  (x4.0)
Theoretical optimal:     27 ms  (x5.2)  [fully memory-bound case]

This new implementation is 4 times faster than the initial parallel implementation and about 3 times faster than Numpy.

Delving into the generated assembly code

When parallel=False is set and prange is replaced with range, the following assembly code is generated on my Intel processor supporting AVX-2:

.LBB0_7:
    vmovdqu 112(%rdx,%rax,8), %xmm1
    vmovdqa 384(%rsp), %xmm3
    vpshufb %xmm3, %xmm1, %xmm0
    vmovdqu 96(%rdx,%rax,8), %xmm2
    vpshufb %xmm3, %xmm2, %xmm3
    vpunpcklwd  %xmm0, %xmm3, %xmm3
    vmovdqu 80(%rdx,%rax,8), %xmm15
    vmovdqa 368(%rsp), %xmm5
    vpshufb %xmm5, %xmm15, %xmm4
    vmovdqu 64(%rdx,%rax,8), %xmm0
    [...] <------------------------------  ~180 other instructions discarded
    vpcmpeqb    %xmm3, %xmm11, %xmm2
    vpandn  %xmm8, %xmm2, %xmm2
    vpor    %xmm2, %xmm1, %xmm1
    vpcmpeqb    %xmm3, %xmm0, %xmm0
    vpaddb  %xmm0, %xmm1, %xmm0
    vpsubb  %xmm4, %xmm0, %xmm0
    vmovdqu %xmm0, (%r11,%rax)
    addq    $16, %rax
    cmpq    %rax, %rsi
    jne .LBB0_7

The code is not very good because it uses many unneeded instructions (like SIMD comparison instructions probably due to implicit casts from boolean types), a lot of register are temporary stored (register spilling) and also it uses 128-bit AVX vectors instead of 256-bit AVX ones supported on my machine. That being said, the code is vectorized and each loop iteration writes on 16-bytes at once without any conditional branches (except the one of the loop) so the resulting performance is not so bad.

In fact, the Numpy code is much smaller and more efficient. This is why it is about 2 times faster than the sequential Numba code on my machine with big inputs. Here is the hot assembly loop:

4e8:
    mov      (%rdx,%rax,8),%rcx
    bswap    %rcx
    mov      %rcx,0x20(%rsp)
    mov      0x8(%rdx,%rax,8),%rcx
    add      $0x2,%rax
    movq     0x20(%rsp),%xmm0
    bswap    %rcx
    mov      %rcx,0x20(%rsp)
    movhps   0x20(%rsp),%xmm0
    pcmpeqb  %xmm1,%xmm0
    pcmpeqb  %xmm1,%xmm0
    pmovmskb %xmm0,%ecx
    mov      %cl,(%rsi)
    movzbl   %ch,%ecx
    mov      %cl,(%rsi,%r13,1)
    add      %r9,%rsi
    cmp      %rax,%r8
    jg       4e8

It read values by chunks of 8-bytes and compute them partially using 128-bit SSE instructions. 2 bytes are written at per iterations. That being said, it is not optimal either because 256-bit SIMD instructions are not used and I think the code can be optimized further.

When the initial parallel code is used, here is the assembly code of the hot loop:

.LBB3_4:
     movq %r9, %rax
     leaq (%r10,%r14), %r9
     movq %r15, %rsi
     sarq $63, %rsi
     andq %rdx, %rsi
     addq %r11, %rsi
     cmpb $0, (%r14,%rsi)
     setne     %cl
     addb %cl, %cl
     [...] <---------------  56 instructions (with few 256-bit AVX ones)
     orb  %bl, %cl
     orb  %al, %cl
     orb  %dl, %cl
     movq %rbp, %rdx
     movb %cl, (%r8,%r15)
     incq %r15
     decq %rdi
     addq $8, %r14
     cmpq $1, %rdi
     jg   .LBB3_4

The above code is mainly not vectorized and is quite inefficient. It use a lot of instructions (including quite slow ones like setne/cmovlq/cmpb to do many conditional stores) for each iteration just to write 1-byte at a time. Numpy execute about 8 times less instructions for the same amount of written bytes. The inefficiency of this code is mitigated by the use of multiple threads. In the end, the parallel version can be a bit faster on machines with many cores (eg. >= 6).

The improved implementation provided in the beginning of this answer generate a code similar to the above sequential implementation but using multiple thread (so still far from being optimal, but much batter).

Wonderful. I updated my results. You've succeeded a lot more. Is there anything wrong with the way I used your code? — mathfux, Jan 15 '22 at 06:41
I think the signature of `_numba_pack` should be specified (with the `::1`). If you do not know if the stride is 1 or not, then you can write two functions, or use 2 signatures. I expect the code not to compile if the stride is not 1 anyway currently in your code. If this is not that, then it probably means that it comes from the vectorization that behave differently on your machine (since your processor is quite different from mine). If so it would be good to have the generated assembly (using `numba_function.inspect_asm()`) and the sequential time so to check the efficiency of the code. — Jérôme Richard, Jan 15 '22 at 13:19
Thanks for a lot of hw, this is useful. I'll start checking if sources of `numpy` methods are built-in, inspect SIMD instructions and use `numba_function.inspect_asm()` more often. Unfortunately, I work on Jupyter Notebook (not sure if it supports inspecting source code) and don't know a lot about computer architecture. A lot of things to learn in the future are waiting. Thanks & tips for you :) — mathfux, Jan 18 '22 at 20:25

How do I optimise numpy.packbits with numba?

Update

1 Answers1

Benchmark

Delving into the generated assembly code

Linked