I'm running the code below in a Linux kernel module on x86_64. Basically, I iterate over 256 bits, and for each bit that is set to 1, I clear it and perform a certain action. The code below, however, takes a few thousand cycles to run (and the bottleneck is not the code executed within the body of the if statement).
unsigned long *data = ...;
for (int i = 0; i < 256; i++) {
//test_and_clear_bit is a function defined in the Linux kernel
if (test_and_clear_bit(i, data)) {
//bit i was set, so do something
}
}
The bottleneck seems to be the test_and_clear_bit
function. The data I'm iterating over is a hardware-defined data structure that I may only mutate with read-modify-write instructions (according to the Intel manual). This is because the processor may attempt to mutate the data structure at the same time. Thus, I can't fall back on a simple solution such as protecting the data structure with a single spinlock and then just reading and clearing the bits with non-atomic instructions.
Is there a faster way to do this?