Parallelize code which is doing bit wise operation

Question

I have this code to make a 100k*200k matrix occupy less space by packing 8 elements per byte of this AU matrix into A to reduce the memory consumption. This code takes forever to run as you can expect and i am planning on increasing the number of rows to 200k as well. I am running the code on a pretty powerful instance (CPU and GPU)and can scale it so can anyone help parallelize this code so that it is quicker.

import numpy as np
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        A[i,j] = 0
        for k in range(8):
            if AU[i,(j*8)+k] == 1:
                A[i,j] = A[i,j] | (1<<(7-k))

If you simply use numba it should already be quite a bit faster than your code and as well as a vectorised numpy routine. Furthermore you should be able to get to compile it for a GPU target as well... I will put together a version tomorrow and post it... I still have to see what the innermost section does precisely... Maybe one could improve the cache-locality by rewriting the loop somehow... — 2b-t, Apr 22 '21 at 22:26
I assume the random portion is just test data, or you'd just generate the data in that format to begin with, e.g. `A = np.random.randint(256,size=(rows, colm),dtype=np.int8)` — Mark Tolonen, Apr 22 '21 at 22:36
the data in AU is my real data that I will read from a while but since it will be so huge, I am compressing it down like this — secretive, Apr 22 '21 at 22:40

score 6 · Accepted Answer · answered Apr 23 '21 at 00:34

6

Warning: You try to allocate a huge amount of memory: about 2 TB of memory that you probably do not have.

Assuming you have enough memory or you can reduce the size of the dataset, you can write a much much faster implementation using the Numba JIT. Moreover, you can parallelize the code and replace the slow conditional with a branchless implementation to significantly speed up the computation since AU is filled with binary values. Finally, you can unroll the inner loop working on k to make the code even faster. Here is the resulting implementation:

import numpy as np
import numba as nb
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

compute(A, AU)

On my machine, this code is 37851 times faster than the original implementation on a smaller dataset (with colm=int(20000/8) and rows=10000). The original implementation took 6min3s while the optimized one took 9.6ms.

This code is memory bound on my machine. With the current inputs, this code is close to be optimal as it spends most of its time reading the AU input matrix. A good additional optimization would be to "compress" the AU matrix to a smaller one (if possible).

answered Apr 23 '21 at 00:34

Jérôme Richard

41,678
6
29
59

Correct me if I am wrong but `20,000 / 8 * 10,000 = 25,000,000`. We are talking about 25MB here. Hardly a feat. Not to criticize your work, but if it's all what Python can do, it's time to use C or maybe Julia. – Tarik Apr 23 '21 at 00:59
1

@Tarik You forgot to multiply the size by 8 (see `cols = int(colm*8)`). The arrays are of size: `rows * cols = 10000 * int(int(20000/8)*8) = 10000 * 20000 = 200 MB` and `rows * colm = 10000 * int(20000/8) = 25 MB`. Reading/Writing 225 MB in 9.6 ms means that the memory throughput is 23.4 GB/s which is quite high. Not to mention that I have an Intel processor. Due to the write allocate behaviour of Intel processor caches, memory writes actually cause a read from the memory. Thus, the true memory throughput is 26 GB/s. This is close to the practical maximum throughput I can reach (~38 GB/s). – Jérôme Richard Apr 23 '21 at 11:39
@Tarik As for the memory size, I just wanted to make a small test. I could try a 2.25 GB test but the original version would take 1h to compute... The optimized version takes about 0.13 second on this 2.25GB dataset. It seems very good to me (for a common desktop machine). – Jérôme Richard Apr 23 '21 at 11:47
2

You could also declare c-contiguous arrays `'void(uint8[:,::1],int8[:,::1])'` or leave it without a signature. The difference isn't that high (just about 10% in the single threaded version), but explicitly declaring non contiguous arrays is almost never a good idea.This often prevents SIMD-vectorization and could be much slower. – max9111 Apr 23 '21 at 11:49
1

@JérômeRichard The code you provided is excellent. I could not do better with Python. I am looking into moving to Julia but so far I got mixed results. Julia scales better but it's sometimes challenging to match optimized Python code. – Tarik Apr 23 '21 at 12:22
If I multithread the Julia code (from the sibling question) over the outer loop, I get runtime of <9ms on my laptop. I haven't checked the Python code, as I haven't yet installed python. I think both the numba and julia codes are very good, but LoopVectorization.jl might be able to improve on it still. – DNF Apr 23 '21 at 14:03

Tarik · Answer 2 · 2021-04-23T11:52:18.453

Python solution:

import numpy as np
import time

def compute(A, AU):
    A[:,:] = 0
    # Put every 8 columns in AU into A
    for i in range(A.shape[1]):
        A[:, i//8] = np.bitwise_or(A[:, i//8], np.left_shift(AU[:, i], i % 8))

colm = int(20000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)

start_time = time.time()

compute(A, AU)
    
end_time = time.time()
print(end_time - start_time)

Packs bits in 1/2 second

Same code in C:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char* argv[]) {

    int colm = 200000/8;
    int rows = 10000;
    int cols = colm*8;
    unsigned char *A = (unsigned char *)malloc(rows * colm * sizeof(unsigned char)); 
    unsigned char *AU = (unsigned char *)malloc(rows * cols * sizeof(unsigned char)); 
    int i, j;
    clock_t begin;
    clock_t end;
    double time_spent;

    begin = clock();
        
    // Create AU
    for (i = 0; i < rows; i++)
        for (j = 0; j < cols; j++)
            *(AU + i*cols + j) = (unsigned char) (rand() & 0x01);  
            
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to create AU\n", time_spent);
            
    begin = clock();
    
    // Create a zeroed out A
    for (i = 0; i < rows; i++)
        for (j = 0; j < colm; j++)
            *(A + i*colm + j) = (unsigned char) 0;  

    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to create A\n", time_spent);

    begin = clock();
            
    // Pack into bits
    for (i = 0; i < rows; i++)
        for (j = 0; j < colm; j++) {
            int au_idx = i*cols + j*8;
            for (int k=0; k<8; k++)
                *(A + i*colm + j) |= *(AU + au_idx + k) << k;
            }
            
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to pack\n", time_spent);
            

    free(A); 
    free(AU);
    return 0;
}

Tested with colm=200,000. Bit packing takes 0.27 seconds against 0.64 seconds for the optimized Python version provided by Jérôme Richard. Calls to rand() are expensive and greatly increase overall runtime. In terms of memory, the C version peaks at 2GB against Python's 4.2GB. Further code optimization and parallelization would certainly reduce runtime.

Julia version:

using Random
colm = 200000÷8
rows = 30000
cols = colm*8

AU = zeros(UInt8, (rows, cols))

rand!(AU)
AU .&= 0x01

A = zeros(UInt8, (rows, colm))

function compute(A, AU)
    for i in 1:size(A)[2]
        start_col = (i-1) << 3
        @views A[:, i] .=  AU[:, start_col + 1] .| 
                   (AU[:, start_col + 2] .<< 1) .|
                   (AU[:, start_col + 3] .<< 2) .|
                   (AU[:, start_col + 4] .<< 3) .|
                   (AU[:, start_col + 5] .<< 4) .|
                   (AU[:, start_col + 6] .<< 5) .|
                   (AU[:, start_col + 7] .<< 6) .|
                   (AU[:, start_col + 8] .<< 7)        
    end
end

@time compute(A, AU)

Julia scales well in terms of performance. Results with colm=25,000 and rows=30,000:

Language  Total Run Time (secs)   Bit Packing Time (secs)  Peak Memory (GB)
Python    22.1                    3.0                      6
Julia     11.7                    1.2                      6

its producing error in second for loop `TypeError: ufunc 'bitwise_or' output (typecode 'h') could not be coerced to provided output parameter (typecode 'B') according to the casting rule ''same_kind''` — secretive, Apr 22 '21 at 22:26
You probably have to initiate the numpy array with zero bytes by specifying the data type. — Tarik, Apr 23 '21 at 00:39
What compiler did you use? I get using `clang Code.c -O3 -march=native -o Code.exe` quite the same results on Numba and Clang (both around 490 ms,both single threaded) and around 100 ms for the Numba parallel version. (using `'void(uint8[:,::1],int8[:,::1])'` or no typing signature) — max9111, Apr 23 '21 at 11:45
@max9111 I am using gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0. I got a core dump with 30,000 rows. Maybe my code is broken (have not been doing C code since the 90's). — Tarik, Apr 23 '21 at 11:56
@Tarik your C and Julia code give *wrong results*: the correct shift `(7-k)` and not `k`. I think the Python code have the same issue. Moreover, note that the type of the input `AU` matrix is not exactly the same as the OP need: it is `int8` and not `uint8`. This may have an impact on the performances. — Jérôme Richard, Apr 23 '21 at 12:18
@JérômeRichard Yeah, I ignored the bit order. The idea is to pack 8 bits gleaned from 8 int array elements that contain either one or zero, into a single byte. — Tarik, Apr 23 '21 at 12:27
Besides this, my Python code is roughly 2 time faster on my 6-core machine than the (sequential) C code (using clang with `-O3`). A parallel C code should likely outperfom all other implementations, but is more tedious to write and maintain and I am not sure it will faster by a large margin ;) . Note that you should be careful about how to measure the timings as if you initialize the `A` matrix before, then you will not pay the cost of doing pages faults in the computational loop. This is an important point to consider if you want to compare the different implementations precisely. — Jérôme Richard, Apr 23 '21 at 12:35
@JérômeRichard To make a fair comparison with Numba you also have to specify march=native (as Numba does it) for the C code. Only -O3 leads to 70% longer runtimes using eg. clang. — max9111, Apr 23 '21 at 13:03
@JérômeRichard Yes, C code is a pain. That's why I am looking into Julia that provides performance along with an elegant syntax. You are right about my perf measurement being approximative. That said, completely eliminating side effects is not easy. When performance differs by +-10%, who cares. But when it's 2x or more, c'est une autre paire de manches. — Tarik, Apr 23 '21 at 13:07
@max9111 Good catch. I forgot it. The C code is a bit faster with `-march=native` but still significantly slower ;) . — Jérôme Richard, Apr 23 '21 at 13:09

score 0 · Answer 3 · answered Apr 23 '21 at 21:21

After reading your post initially yesterday I actually planned to write a routine myself with the just-in-time compiler numba but Jérôme was faster than me and delivered you an excellent solution. But I have an alternative to offer: Why reinvent the wheel when there already exists a numpy function which does exactly the same: numpy.packbits.

import numpy as np
A = np.packbits(AU,axis=-1)

does the job. From my tests it seems to be quite a bit slower than Jérôme's version but anyways way faster than your initial version.

Parallelize code which is doing bit wise operation

3 Answers3

Linked