How to get `gcc` to generate `bts` instruction for x86-64 from standard C?

Question

Inspired by a recent question, I'd like to know if anyone knows how to get gcc to generate the x86-64 bts instruction (bit test and set) on the Linux x86-64 platforms, without resorting to inline assembly or to nonstandard compiler intrinsics.

Related questions:

Why doesn't gcc do this for a simple |= operation were the right-hand side has exactly 1 bit set?
How to get bts using compiler intrinsics or the asm directive

Portability is more important to me than bts, so I won't use and asm directive, and if there's another solution, I prefer not to use compiler instrinsics.

EDIT: The C source language does not support atomic operations, so I'm not particularly interested in getting atomic test-and-set (even though that's the original reason for test-and-set to exist in the first place). If I want something atomic I know I have no chance of doing it with standard C source: it has to be an intrinsic, a library function, or inline assembly. (I have implemented atomic operations in compilers that support multiple threads.)

Hmm, good question. I see `vtst_*` for vector bit-test on ARM+NEON, but nothing more general... — ephemient, Jan 11 '10 at 05:06
If bts is really faster, send a bug report. I am sure gcc programmers are already aware of bts' existance. After all, a compiler isn't supposed to map 1:1. — , Jan 11 '10 at 05:45
Better still, send a patch to use bts together with testcases that can be profiled to prove that the optimization is worthwhile. — Stephen C, Mar 06 '10 at 05:30
@jbcreix: The `gcc` bug report you're asking for has been filed (and fixed for 4.3.0 ), two years before this SO posting, see: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36473 — FrankH., Mar 12 '13 at 09:28

Maciej Piechotka · Answer 1 · 2013-03-12T10:57:48.500

2

It is in the first answer for the first link - how much does it matter in grand scheme of things. The only part when you test bits are:

Low level drivers. However if you are writing one you probably know ASM, it is sufficient tided to the system and probably most delays are on I/O
Testing for flags. It is usually either on initialisation (one time only at the beginning) or on some shared computation (which takes much more time).

The overall impact on performance of applications and macrobenchmarks is likely to be minimal even if microbenchmarks shows an improvement.

To the Edit part - using bts alone does not guarantee the atomic of the operation. All it guarantee is that it will be atomic on this core (so is or done on memory). On multi-processor units (uncommon) or multi-core units (very common) you still have to synchronize with other processors.

As synchronization is much more expensive I belive that difference between:

asm("lock bts %0, %1" : "+m" (*array) : "r" (bit));

and

asm("lock or %0, %1" : "+m" (*array) : "r" (1 << bit));

is minimal. And the second form:

Can set several flag at once
Have nice __sync_fetch_and_or (array, 1 << bit) form (working on gcc and intel compiler as far as I remember).

edited Mar 12 '13 at 10:57

answered Mar 06 '10 at 11:21

Maciej Piechotka

7,028
6
39
61

1

"how much does it matter", well, depending on the CPU type, `bts` is 20% faster. See the `gcc` bug report in the comments to the question. – FrankH. Mar 12 '13 at 09:29
1

@FrankH.: I've clarified. `bts` is 20% faster in microbenchmark but if it does not improve overall performance there is no reason to make compiler more complicated just to improve microbenchmark. Apparently they have found use - but compilers are more often driven by macro- then micro- benchmarks. – Maciej Piechotka Mar 12 '13 at 10:59
@MaciejPiechotka: For various types of software (e.g. "propositional logic engines" that work on large arrays of bits), 20% faster in micro-benchmarks can translate to maybe 19% faster in real world software. One of the cases for `bts` is allocators (e.g. one bit per thing you could allocate/free) where the fact that `bts` and `btr` can be atomic (with a `lock` prefix) means you can end up with block-free algorithms in multi-threaded code. Also note that these instructions happily work on arrays of 2 billion bits or more (in memory) with no additional "scaffolding". – Brendan Jul 01 '15 at 05:47
@Brendan yeah sure - the question was if the additional speed for domain-specific application is good enough to extend - already complicated - optimizer (apparently it was). My guess is that there were (before 4.3) much lower hanging fruits for the probably overworked gcc devs to pick - I'm sure they have a long TODO list and they need to prioritize. Beside such engines would be "hand optimized" anyway - and for atomic operations you have `__sync_fetch_and_or` and - in 2015 - whole C11 atomics (ok - AFAIK it's implemented only for C++11 but let's not get into such details). – Maciej Piechotka Jul 01 '15 at 08:49
`bts` with a memory operand can index outside the dword addressed by `*array`, so that's not safe. You have to tell gcc that the whole array is the memory operand, not just one element. Also, you don't return the old value, so it's not `fetch_or` or test-and-set, it's just an atomic `set`. Moreover, the OP was asking for non-atomic `bts reg,reg` as a performance optimization. `bts mem,reg` with a memory operand is slower than load / `bts r,r` / store because of the bitstring semantics. (on Sandybridge-family for example. http://agner.org/optimize/) – Peter Cordes Jun 13 '18 at 09:24

score 1 · Answer 2 · answered Jan 11 '10 at 09:51

1

I use the gcc atomic builtins such as __sync_lock_test_and_set( http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html ). Changing the -march flag will directly affect what is generated. I'm using it with i686 right now, but http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options shows all the possibilities.

I realize it's not exactly what you are asking for, but I found those two web pages very useful when I was looking for mechanisms like that.

answered Jan 11 '10 at 09:51

laura

7,280
4
35
43

gcc 4.2 on x86-64 `__sync_lock_test_and_set` generates an `xchg` not `bts`. – kennytm Jan 11 '10 at 10:08
1

`bts` sets a bit, whereas the one I mentioned sets a whole variable. I linked to the page in the hope that the OP might find something useful there. – laura Jan 11 '10 at 10:20

score 0 · Answer 3 · answered Jan 11 '10 at 09:44

I believe (but am not certain) that neither the C++ or C standards have any mechanisms for these types of synchronization mechanisms yet. Support for higher level synchronization mechanisms are in various states of standardization, but I don't even think one of those would allow you the access of the type of primitive you're after.

Are you programming lock-free datastructures where locks are insufficient?

You probably want to just go ahead and use gcc's non-standard extensions and/or operating system or library provided synchronization primitives. I would bet there's a library that might provide the type of portability you're looking for if you're concerned about using compiler intrinsics. (Though really, I think most people just bite the bullet and use gcc-specific code when they need it. Not ideal, but the standards haven't really been keeping up.)

OP isn't asking for synchronization methods. OP is asking if there is a portable way to hint to the compiler to use `bts` instead of `shl` + `or`, as the former is faster. — ephemient, Jan 11 '10 at 18:50

How to get `gcc` to generate `bts` instruction for x86-64 from standard C?

3 Answers3

Linked