3

I am trying to write my first multi-threaded program in C under Linux. I already have a program playing with setting and resetting bits in a big buffer, now I just want to make it faster - as fast as possible without writing everything in assembler.

For the single-threaded program, I defined my own macros for the bit manipulation (they kind of look big and ugly, but they work):

#define CELL_SIZE (64)

#define getBit(bitIdx, arr)   ( arr[ (bitIdx) / CELL_SIZE ] &   ( (1uL << (CELL_SIZE - 1)) >> (bitIdx) % CELL_SIZE))
#define resetBit(bitIdx, arr) ( arr[ (bitIdx) / CELL_SIZE ] &= ~( (1uL << (CELL_SIZE - 1)) >> (bitIdx) % CELL_SIZE))

Obviously, those operations are anything BUT atomic.

I did some research and I found several answers. Some answers are about how to write my own functions in assembler (which I want to avoid).

Other sites (this or this) seem to talk exactly about what I need - except that I have no idea how to use those interfaces / macros.

I tried the following includes, but they do not work.

#include <linux/bitmap.h>
#include <asm/bitops.h>

So, basically, I ask how can I use what is already implemented in order to get my work done? Which header(s) should I include?

I do not mind if the interfaces are not in the kernel, as long as the job gets done.

Note: each thread would (re)set exactly one bit at one time - I do not plan any VOODOO complicated stuff.


A side question would be: which one would be faster (overall, written in pseudo-code):

reset_bit();

or

if ( bit_is_set() )
    reset_bit();

It is quite often that the reset_bit() operation is not necessary, as the bit is already reset. Same question about setting a set bit.

virolino
  • 2,073
  • 5
  • 21
  • That's a tricky question. Could you avoid the need for a synchronized bitset by having one bitset for each thread, then ANDing the results of each sieve thread together? – Nick ODell Feb 20 '22 at 01:01
  • I already thought about this, but the way I understand it is that I actually slow down all the threads in order to allow them to work with non-atomic operations, and then mix the results. In the end, I get single-threading-with-extra-steps-and-with-performance-penalty. IF I understood correctly your idea. – virolino Feb 20 '22 at 01:06
  • Another thing: I surely do not want to give each thread its own buffer, and then mix the buffers. I understand that higher speed has a cost, but wasting memory like that is not an option. – virolino Feb 20 '22 at 01:10
  • 1
    Non-atomic operations are faster than atomic operations, unless the atomic operation allows for a faster algorithm, or avoids other locking. – Nick ODell Feb 20 '22 at 01:10
  • The `linux/*` and `asm/*` headers are for writing code to run in the kernel itself. That's not what you're doing, is it? – Nate Eldredge Feb 20 '22 at 01:10
  • Nope, I do not touch the kernel in any way. I play nicely outside :) However, if I get support from the kernel, I do not mind at all. – virolino Feb 20 '22 at 01:12
  • The most straightforward way is to use an array of `_Atomic` integers, then `atomic_fetch_or` to set bits and `atomic_fetch_and` to clear them. These are standard C11 [atomics](https://en.cppreference.com/w/c/atomic). They are lock-free on x86-64 so they run directly on the CPU, the kernel has nothing to do with it. However, I suspect you will not be too happy with the resulting performance. – Nate Eldredge Feb 20 '22 at 01:12
  • 1
    I think most people learning about threads for the first time have an instinct to do something like this - I tried it with a Sieve of Eratosthenes. The problem is that atomic read-modify-write operations are many times slower to begin with, and cache line contention between threads slows things down whether atomic or not. The multithreaded approach is likely going to be slower, perhaps much slower, unless you carefully design your algorithm to minimize contention. – Nate Eldredge Feb 20 '22 at 01:16
  • As to your last question, hard to say. A load is generally faster than a read-modify-write. However, the cost of doing the test and conditional branch, and especially the possibility of mispredicting that branch, may nullify the savings. – Nate Eldredge Feb 20 '22 at 01:18
  • @NateEldredge: See also [How to set bits of a bit vector efficiently in parallel?](https://stackoverflow.com/q/45556086) - my answer there proposes an optimistic approach where a thread sets a bunch of bits in a small range, then reads back (after a memory barrier) to make sure all the bits actually got set. And BeeOnRope's answer is a more general look at avoiding the things that make this approach terrible. – Peter Cordes Feb 20 '22 at 03:39

1 Answers1

5

To answer the question directly, the most straightforward and readily available solution is C11 <stdatomic.h> which any modern C compiler for a multiprocessor system should provide.

Available operations include atomic OR/AND which you can use to set and clear bits. So you might do

#include <stdatomic.h>
#include <stdint.h>
#include <stddef.h>

typedef uint64_t cell_t;

#define CELL_SIZE (64)

static inline _Atomic cell_t *get_cell(size_t bitIdx, _Atomic cell_t *arr) {
    return arr + (bitIdx / CELL_SIZE);
}

static inline cell_t get_mask(size_t bitIdx) {
    return (1uL << (CELL_SIZE - 1)) >> (bitIdx) % CELL_SIZE;
}
  
static inline cell_t getBit(size_t bitIdx, _Atomic cell_t *arr) {
    return atomic_load(get_cell(bitIdx, arr)) & get_mask(bitIdx);
}

static inline void resetBit(size_t bitIdx, _Atomic cell_t *arr) {
    atomic_fetch_and(get_cell(bitIdx, arr), ~get_mask(bitIdx));
}

static inline void setBit(size_t bitIdx, _Atomic cell_t *arr) {
    atomic_fetch_or(get_cell(bitIdx, arr), get_mask(bitIdx));
}

Try on godbolt

I took the liberty of replacing your macros with inline functions, which are preferable in practically all situations.

You might be able to speed things up somewhat, on some platforms, by using the *_explicit versions with memory_order_relaxed, but it depends greatly on how the functions will be used, and an introduction to memory ordering is outside the scope of this post.

The performance implications are harder. Atomic read-modify-write operations are a lot slower than non-atomic, usually. And if there is contention between CPUs for cache lines, that slows things down further. Unless you can control these aspects really well, your plan of putting more cores to work on the task will probably make it slower.

As to your last question about whether to test the bit first, hard to say. A load is generally faster than a read-modify-write. However, the cost of doing the test and conditional branch, and especially the possibility of mispredicting that branch, may nullify the savings.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • I already wrote macros which are almost copy / paste of your code (at least from an approach POV). I calculated cell and mask, and then used them in the `atomic_fetch_and()`. However, I thought I was doing an "optimization" by taking the calculation of cell and mask outside (ahead) of the "atomic" part. I.e. I passed to `atomic_fetch_and()` a pointer and a value, already calculated. Is there a reason to include all those calculations in the "atom"? – virolino Feb 20 '22 at 03:04
  • 1
    @virolino: That's not how C works. An expression as a function arg is computed strictly before the function is called. So your version would compile to the same asm as Nate's; check https://godbolt.org/z/v147a8x5d (or the link in the answer): for x86-64 it compiles to a bunch of work ahead of a single `lock or QWORD PTR [rsi+rax*8], rdx` instruction. (There is in fact no other sane way for it to compile, unless the compiler was going to invent a CAS retry loop, like it would have to if the return value was used.) – Peter Cordes Feb 20 '22 at 03:21
  • @Nate: You forgot return types for the last 3 functions in the code block in the question; I guess you forgot to copy that back from Godbolt. – Peter Cordes Feb 20 '22 at 03:22
  • @virolino: Actually, I guess it could compile to `lock bts [rsi], rdi` if the bit-indexing within a qword wasn't done backwards (e.g. `1UL< – Peter Cordes Feb 20 '22 at 03:30
  • Using 64-bit "cells" increases contention between multiple bits in the *same* thread, which could otherwise potentially have some ILP. Maybe not on x86 because of the `lock` prefix semantics, but bits in adjacent bytes can be independently modified. This is inconvenient for C if you also want to be able to scan through them efficiently, although 8 bytes at a time is still crap vs. AVX2, but still 8x better than 1 byte at a time, and without C++20 `atomic_ref` an object has to be `_Atomic` for the whole life of the program, unless you manually use compiler builtins like `__atomic_fetch_or`. – Peter Cordes Feb 20 '22 at 03:35
  • Also, `uint64_t` is very bad on some 32-bit ISAs, being wider than a register forcing less efficient atomic RMW. 4-byte chunks might be ok, or just `int`. Or for x86, `char` is great. Maybe also worth trying for ARM / AArch64. – Peter Cordes Feb 20 '22 at 03:36