Bit packing performance in Julia vs Python

Question

This question (Parallelize code which is doing bit wise operation) got a very good answer and extremely efficient code that I could only match with C code. That prompted me to attempt to at least match the Python code in Julia.

The following is the Python code that takes 0.64 seconds vs C code 0.27 seconds to pack bits in unsigned integers.

import numpy as np
import numba as nb
import time
colm = int(200000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

start_time = time.time()

compute(A, AU)

end_time = time.time()
print(end_time - start_time)

The following are mare various failed attempts to match that performance in Julia:

using Random
colm = 200000÷8
rows = 10000
cols = colm*8

AU = zeros(UInt8, (rows, cols))

rand!(AU)
AU .&= 0x01

A = BitArray(undef, rows, cols)
B = zeros(UInt8, (rows, colm))

function compute1(A, AU)
    A[:,:] .= AU .== 1
end

function compute2(A, AU)
    for i in 1:size(A)[2]
        start_col = (i-1) << 3
        A[:, i] .=  AU[:, start_col + 1]        .| 
                   (AU[:, start_col + 2] .<< 1) .|
                   (AU[:, start_col + 3] .<< 2) .|
                   (AU[:, start_col + 4] .<< 3) .|
                   (AU[:, start_col + 5] .<< 4) .|
                   (AU[:, start_col + 6] .<< 5) .|
                   (AU[:, start_col + 7] .<< 6) .|
                   (AU[:, start_col + 8] .<< 7)        
    end
end

function compute3(A, AU)
    for i in 1:size(A)[2]
        start_col = (i-1) << 3
        A[:, i] .|=  AU[:, start_col + 1] 
        A[:, i] .|=  AU[:, start_col + 2] .<< 1
        A[:, i] .|=  AU[:, start_col + 3] .<< 2
        A[:, i] .|=  AU[:, start_col + 4] .<< 3
        A[:, i] .|=  AU[:, start_col + 5] .<< 4
        A[:, i] .|=  AU[:, start_col + 6] .<< 5
        A[:, i] .|=  AU[:, start_col + 7] .<< 6
        A[:, i] .|=  AU[:, start_col + 8] .<< 7
    end
end

function compute4(A, AU)
    for i in 0:7
        au_columns = [((j-1) << 3) + i + 1 for j in 1:size(A)[2]] 
        A[:, :] .|=  AU[:, au_columns] .<< i
    end
end

@time compute1(A, AU)
@time compute2(B, AU)
@time compute3(B, AU)
@time compute4(B, AU)

Output:

  6.128301 seconds (553.01 k allocations: 30.192 MiB, 2.22% compilation time)
  3.640022 seconds (1.97 M allocations: 1.984 GiB, 3.05% gc time, 12.43% compilation time)
  2.956211 seconds (1.44 M allocations: 3.842 GiB, 3.73% gc time, 19.24% compilation time)
  6.720456 seconds (946.91 k allocations: 3.776 GiB, 2.40% gc time, 4.68% compilation time)

The different methods take between 3 and 6 seconds. Not sure how to improve the performance to at least match Python / Numba

Quick note: slicing in Julia creates a copy, so every time you do `A[:, ...]` you're allocating a whole new array. Use views: https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-views — giordano, Apr 23 '21 at 10:55
The numba code is highly optimized and parallel. I am surpised you were able to match the performance with simple C code. You will need to do something similar to `nb.prange` in julia to come close. — MegaIng, Apr 23 '21 at 10:57
Yep, views help. `compute4` without views: `3.545348 seconds (400.00 k allocations: 3.791 GiB, 16.12% gc time)`; with views: `1.850615 seconds (200.00 k allocations: 1.895 GiB, 14.13% gc time)`. Notice the decrease in the number of allocations. — ForceBru, Apr 23 '21 at 10:58
For me `compute2` without `@views` is `2.898168 seconds (1.98 M allocations: 1.985 GiB, 5.86% gc time, 10.84% compilation time)`, with `@views` is `0.363232 seconds (667.67 k allocations: 32.487 MiB, 4.75% gc time, 50.73% compilation time)`. Also, better use https://github.com/JuliaCI/BenchmarkTools.jl for benchmarking Julia code, and interpolate global variables: ``` julia> @btime compute2($B, $AU); 172.111 ms (0 allocations: 0 bytes) ``` — giordano, Apr 23 '21 at 11:01
@Megalng It's now matched and even improved by just using @views macro in compute2 function. Using `A = Bool.(A2)` gives similar performance. See @VincentYu answer. — Tarik, Apr 23 '21 at 13:55

score 3 · Answer 1 · answered Apr 23 '21 at 11:58

3

Why aren't you trying to make the Julia code look like the Python code if you want to compare their performance? So something like:

rowm = 200000 ÷ 8
cols = 10000
rows = rowm * 8

AU = rand(Int8.(0:1), rows, cols)
A = zeros(UInt8, rowm, cols)

function compute!(A, AU)
    for i in 1:size(A, 2)
        Threads.@threads for j in 1:size(A, 1)
            offset = (j-1) * 8 + 1
            res =  AU[offset  , i] << 7
            res |= AU[offset+1, i] << 6
            res |= AU[offset+2, i] << 5
            res |= AU[offset+3, i] << 4
            res |= AU[offset+4, i] << 3
            res |= AU[offset+5, i] << 2
            res |= AU[offset+6, i] << 1
            res |= AU[offset+7, i]
            A[j, i] = res % UInt8
        end
    end
end

Note that Julia is column-major so the order of indices should be swapped. And you need to explicitly start julia with multiple threads to have the multi-threading be useful (julia -t8 on Julia 1.6).

answered Apr 23 '21 at 11:58

Kristoffer Carlsson

2,768
9
16

1

I am less into imitating than figuring out how to beat the best Python can come up with. – Tarik Apr 23 '21 at 12:01
I tried your code using colm=200000/8 and rows=30000. Total runtime 48.0 seconds vs 11.7 seconds using the compute2() function modified with @views macro. Although I specified julia --threads 4, multi-threading does not seem to kick in for some reason. I only see one CPU going 100%. – Tarik Apr 23 '21 at 12:19
1

Move `Threads.@threads` to the _outer_ loop (over `i`), and add `@inbounds`. That gave 20x speedup on my computer, running in 100ms for 200000x10000. For 200000x30000 matrix, it ran in 330ms. – DNF Apr 23 '21 at 13:07
1

BTW, creating `AU` is quite a strain on my patience now, but sample from a tuple instead of a range, and it's a lot faster. So `rand(UInt8.((0, 1)), rows, cols)`. – DNF Apr 23 '21 at 13:22
@DNF I already moved the `@Threads` to the outer loop and it had zero effect. Where should I put the `@inbounds` macro. – Tarik Apr 23 '21 at 14:10
That is extremely strange. Moving the `@threads` macro it should match numba. Put `Threads.@threads` in front of the first loop, and `@inbounds` in front of the second, for example. – DNF Apr 23 '21 at 14:16

Vincent Yu · Answer 2 · 2021-04-23T12:19:58.890

For single-threaded conversion into a BitArray, ~~Bool.(AU)~~ AU .% Bool (see edit note) should be efficient:

using Random
using BenchmarkTools
AU = rand(Bool, 10_000, 200_000)
@benchmark Bool.($AU)

BenchmarkTools.Trial: 
  memory estimate:  238.42 MiB
  allocs estimate:  4
  --------------
  minimum time:     658.897 ms (0.00% GC)
  median time:      672.948 ms (0.00% GC)
  mean time:        676.287 ms (0.86% GC)
  maximum time:     710.870 ms (6.57% GC)
  --------------
  samples:          8
  evals/sample:     1

Edit: I just realized Bool.(AU) won't work well for you because you are converting from an array of 8-bit integers, not an array of Bools, so Bool.(AU) requires checking that every element of AU is 0 or 1. Instead, use AU .% Bool, which will just take the least bit of each integer and should have the performance shown above.

Bool.(AU) is slow indeed. AU .% Bool performs well but does not beat compute2() with the @views macro. — Tarik, Apr 23 '21 at 13:01

Bit packing performance in Julia vs Python

2 Answers2