0

I want to do bitTestAndSet on a tbb atomic variable.

atomic.h from tbb does not seem to have any bit operations.

If I treat the tbb atomic variable as a normal pointer and do __sync_or_and_fetch gcc compiler doesn't allow that.

Is there a workaround for this?

Related question:

assembly intrinsic for bit test and set (BTS)

Community
  • 1
  • 1
arunmoezhi
  • 3,082
  • 6
  • 35
  • 54

1 Answers1

2

A compare_and_swap loop can be used, like this:

// Atomically perform i|=j. Return previous value of i.
int bitTestAndSet( tbb::atomic<int>& i, int j ) {
    int o = i;                  // Atomic read (o = "old value")
    while( (o|j)!=o ) {         // Loop exits if another thread sets the bits
        int k = o;
        o = i.compare_and_swap(k|j,k);
        if( o==k ) break;       // Successful swap
    }
    return o;
}

Note that if the while condition succeeds on the first try, there will be only an acquire fence, not a full fence. Whether that matters depends on context.

If there is risk of high contention, then some sort of backoff scheme should be be used in the loop. TBB uses a class atomic_backoff for contention management internally, but it's not currently part of the public TBB API.

There is a second way, if portability is not a concern and you are willing to exploit the undocumented fact that the layout of a tbb::atomic and T are the same on x86 platforms. In that case, just operate on the tbb::atomic using assembly code. The program below demonstrates this technique:

#include <tbb/tbb.h>
#include <cstdio>

inline int SetBit(int array[], int bit) {
    int x=1, y=0;
    asm("bts %2,%0\ncmovc %3,%1" : "+m" (*array), "+r"(y) : "r" (bit), "r"(x));
    return y;
}

tbb::atomic<int> Flags;
volatile int Result;

int main() {
    for( int i=0; i<16; ++i ) {
        int k = i*i%32;
        std::printf("bit at %2d was %d.  Flags=%8x\n", k, SetBit((int*)&Flags,k), +Flags);
    }
}
Arch D. Robison
  • 3,829
  • 2
  • 16
  • 26
  • Thanks for the answer. I do not want the bit set to fail. If I use CAS it can fail to set the bit. I want to blindly set the bit – arunmoezhi Apr 10 '14 at 17:12
  • What difference does it if the bit set fails and is retried? That's the same as if the thread arrived a little bit later. Not all hardware has an atomic bit-set. In fact, if the result of sync_or_and_fetch is used, gcc generates a cmpxchg loop. I'm assuming the result is of interest since the question title says "test and set", not just "set". – Arch D. Robison Apr 10 '14 at 21:19
  • if the machine has a hardware BTS instruction, then will it still use `cmpxchg`? – arunmoezhi Apr 12 '14 at 02:57
  • Yes, the idiom that I showed will still use cmpxchg, regardless of the target platform. – Arch D. Robison Apr 14 '14 at 14:15
  • are you saying `it will still use cmpxchg` since your code uses CAS to implement BTS? Can the hardware BTS instruction be used on a tbb variable? – arunmoezhi Apr 25 '14 at 05:20
  • 1
    The layout of a tbb::atomic is identical to the layout of a T (i.e. we don't hide any extra mechanisms in the object.) So if you wrote your own asm with BTS, and used reinterpret cast, you could apply it to the tbb::atomic. – Arch D. Robison Apr 25 '14 at 14:41
  • so are you saying `tbb:atomic` identifier on a variable is something like `volatile` (but their behaviours are different) and any operation done on that variable without this identifier can be done using this identifier? – arunmoezhi Apr 25 '14 at 18:56
  • Layout refers to the mapping of the structure to the hardware memory. Given `tbb::atomic a;`, use `reinterpret_cast&a` to get a pointer to the raw representation, and apply an asm/intrinsic for BTS to the target of that pointer. – Arch D. Robison Apr 28 '14 at 17:40
  • Too complicated but thanks. I will look into that. Will doing this give better performance than doing BTS using CAS – arunmoezhi Apr 28 '14 at 17:50
  • 1
    BTS would be theoretical faster, because the hardware arbitration logic will will with contenting operations. Whereas in the cmpxchg loop may require retries. But the difference would be significant only under high contention, and under high contention the program will likely perform poorly anyway. – Arch D. Robison Apr 30 '14 at 19:23
  • Can you post how to use BTS with tbb. There is an answer in http://stackoverflow.com/questions/1983303/using-bts-assembly-instruction-with-gcc-compiler, but that doesn't use tbb – arunmoezhi May 03 '14 at 22:03
  • wow..Thanks. Let me add this snippet to my program. I'm also interested in comparing the performance of different types of BTS implementations.. I will do that and share the results with you if you are interested – arunmoezhi May 06 '14 at 18:44
  • in `asm("bts %2,%0\ncmovc %3,%1" : "+m" (*array), "+r"(y) : "r" (bit), "r"(x));`, is the conditional move supported in Xeon Phi? – arunmoezhi Jan 22 '16 at 05:04
  • `cmov` is a worse choice than `setcc` for this, because it requires setting up registers holding 0 and 1. See the `xor / lock bts / setc` inline asm I wrote for the OP's related question: http://stackoverflow.com/questions/34940356/atomic-test-and-set-in-x86-inline-asm-or-compiler-generated-lock-bts. Also, if you aren't using `lock bts`, you might as well use a `"+rm"` constraint, in case the compiler wants to load it before the asm statement, and keep it live in a register. load/bts/store is faster anyway, because of `bts`'s crazy CISC semantics, esp. with a non-immediate bit position. – Peter Cordes Jan 22 '16 at 09:13