Faster test_and_clear_bit

Question

I'm running the code below in a Linux kernel module on x86_64. Basically, I iterate over 256 bits, and for each bit that is set to 1, I clear it and perform a certain action. The code below, however, takes a few thousand cycles to run (and the bottleneck is not the code executed within the body of the if statement).

unsigned long *data = ...;
for (int i = 0; i < 256; i++) {
    //test_and_clear_bit is a function defined in the Linux kernel
    if (test_and_clear_bit(i, data)) {
        //bit i was set, so do something
    }
}

The bottleneck seems to be the test_and_clear_bit function. The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual). This is because the processor may attempt to mutate the data structure at the same time. Thus, I can't fall back on a simple solution such as protecting the data structure with a single spinlock and then just reading and clearing the bits with non-atomic instructions.

Is there a faster way to do this?

you can try with volatile and your own test&clear implementation — Iłya Bursov, Mar 30 '18 at 03:49
You can test 8, 16, 32 or 64 bits at a time, and then only test the individual bits if you don't get 0 for the group of bits. — user253751, Mar 30 '18 at 03:53
I agree with immibis, `atomic_exchange_n(data, 0U)` will test-and-clear an entire group at once, you can then use ordinary register accesses to walk through your local copy of that group (which was created atomically with the clear, meeting the hardware requirements) You can possibly even find the bits in that localized copy in O(set bits) instead of O(total bits) using the count-leading-zeros intrinsic — Ben Voigt, Mar 30 '18 at 05:39
@prl It's the posted interrupt descriptor. I got posted interrupts working perfectly. However, when a posted interrupt is received on a core that is in root mode (i.e. not in non-root mode), I need to inject the interrupt into non-root mode. That's what I'm trying to do here - read and clear the vector(s) from the posted interrupt descriptor (I greatly simplified my implementation above, e.g. ignored outstanding bit, etc.). Given that I don't need to inject very often, this is only a small performance boost, but I'm trying to optimize my implementation as much as possible. — Jack Humphries, Mar 30 '18 at 06:39
Thanks everyone. These suggestions are exactly what I was looking for. I will try to implement the suggestions tomorrow and will let you know. — Jack Humphries, Mar 30 '18 at 06:41
@BenVoigt Hey Ben - I ended up using the function you suggested to copy 64 bits at a time into a local buffer (so I call the function 256 / 64 = 4 times), and then I iterate over all of the bits in the local buffer. My implementation is now a few thousand cycles faster. Thanks for your help! — Jack Humphries, Mar 30 '18 at 18:45

score 2 · Accepted Answer · answered Mar 30 '18 at 05:33

This is a difficult question to answer because we don't know exactly what this data is, and because of this statement:

The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual).

That said, the best we can do is general ideas/recommendations, which may or not apply to your specific situation. You can try the following:

Copy data into a local buffer and iterate over the bits, only calling test_and_clear_bit if the bit was set in the local buffer. This will avoid calling test_and_clear_bit on bits which are not already set on the local buffer. Obviously bits which are not set in the local buffer might have been set between the time of the structure's copy and execution, but if that is an acceptable loss, this will likely speed up the loop significantly.
Test multiple bits at a time, if possible. As @immibis mentions in a comment, if you can check 8, 16, 32, or 64 bits at a time, then only test individual bits if you get a response from a multi-bit set. If it's likely that at least one bit in every 8 or more is set, then this won't work and will actually slow down the loop as it adds an extra unnecessary call.
Attempt your own test_and_clear_bit implementation with volatile, as @IlyaBursov mentions in a comment. This is not guaranteed to work, and what might work on one platform or compiler might not an another. However, if you're using a hardware-defined memory structure, a platform-specific solution might work for you. Note that volatile is unlikely to guard against this processor modifying the bits, but on some platforms (if you are lucky, yours) it very well might be. As mentioned here:

As a result, most implementations do not insert sufficient memory fences to guarantee that other threads, or even hardware devices, see volatile operations in the order in which they were issued

On some platforms, some limited ordering guarantees are provided, either because they are automatically enforced by the underlying hardware or, as on Itanium, because different instructions are generated for volatile references. But the specific rules are highly platform dependent. And even when they are specified for a specific platform, they may be inconsistently implemented.

Thanks! These suggestions were very helpful. The data structure is the posted interrupt descriptor (for Intel VT-x). Not sure if you're familiar with VT-x or posted interrupts, but basically I need to deliver interrupts to the guest operating system that are received when the guest OS is not running at the moment on the destination core. I read the pending interrupts from the descriptor. Anyway, I will try your suggestions and let you know how it goes. Thanks. — Jack Humphries, Mar 30 '18 at 06:45
@JackHumphries If the intended application is storing guest interrupts when the guest OS is not running, then this means copying the data structure altogether should be acceptable. Manipulating the real data directly is only needed in time-sensitive situations; making a copy and then reading (without any type of lock) should be many times faster, and it's not time sensitive because you're queuing the interrupts until the guest OS spins up again — cegfault, Mar 30 '18 at 08:58
Thanks, your second suggestion worked perfectly. As Ben Voight suggested in a comment on the question, I used `__atomic_exchange_n` to atomically copy 64 bits at a time out of the posted interrupt descriptor into a local buffer (using `__ATOMIC_RELAXED` as the memory order), and then I iterate over the local buffer. My implementation is now a few thousand cycles faster :) — Jack Humphries, Mar 30 '18 at 18:41
The problem with `volatile` for this is not the lack of memory-ordering, it's the [lack of atomicity for the read-modify-write operation](https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num)! Using `volatile` won't get the compiler to use `lock btr` or `lock and`, or `xchg`. If no other core can currently be modifying the data in this buffer while this code runs, all you need is regular accesses, maybe with some kind of memory barrier to make sure all operations are done (and not reordered at compile time) before doing whatever allows other threads to resume. — Peter Cordes, Mar 30 '18 at 18:41

Matteo Italia · Answer 2 · 2018-03-30T21:59:21.777

2

Copy and clear all the data in a local buffer by using an atomic exchange (or an atomic fetch and AND with 0); then work on it. This should work just as well as your code, since each and every bit you clear will be processed without risking to ignore and overwrite bits being set "in the meanwhile".

I don't know about the Linux kernel primitives, but with gcc atomic builtins it would be something like:

const int bpl = 8*sizeof(unsigned long);
const int len = (256+bpl-1)/bpl;
unsigned long ldata[len];
for(int i = 0; i < len; ++i) {
    ldata[i] = __atomic_exchange_n(&data[i], 0, __ATOMIC_ACQ_REL);
}
for(unsigned i = 0; i < 256; ++i) {
    if(ldata[i/bpl] & (1<<(i%bpl))) {
        // do your stuff
    } 
}

edited Mar 30 '18 at 21:59

answered Mar 30 '18 at 05:59

Matteo Italia

123,740
17
206
299

Thanks for this helpful suggestion! I will try tomorrow and let you know. – Jack Humphries Mar 30 '18 at 06:45
2

I think you're likely to get better asm with an `exchange` rather than `and` with zero, because x86 doesn't have a fetch-and instruction. The only atomic fetch-op instructions that actually load the old value are `lock xadd` and `xchg`, with `xchg` being kind of a special case (the `op` part being a no-op). So writing it with `fetch_and` would require the compiler to see the special case that `and` with 0 is unconditional and the desired new value doesn't depend on the old value after all. If it misses that, you'll get a load + cmpxchg-retry-loop. – Peter Cordes Mar 30 '18 at 18:34
@PeterCordes: that was my first idea (and what I would have written if doing it in assembly), but looking at the gcc atomic intrinsics I didn't find it, so I went for the next easiest thing. – Matteo Italia Mar 30 '18 at 18:59
Hmm, interesting, the legacy `__sync` builtins don't seem to have a pure exchange. The now-recommended [`__atomic` builtins](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins) (that are aware of C11-style memory ordering semantics) include a `type __atomic_exchange_n (type *ptr, type val, int memorder)`. So `ldata[i] = __atomic_exchange_n(&data[i], 0, __ATOMIC_ACQ_REL)` should be good. – Peter Cordes Mar 30 '18 at 19:08

Faster test_and_clear_bit

2 Answers2