10

Consider a bit vector of N bits in it (N is large) and an array of M numbers (M is moderate, usually much smaller than N), each in range 0..N-1 indicating which bit of the vector must be set to 1. The latter array is not sorted. The bit vector is just an array of integers, specifically __m256i, where 256 bits are packed into each __m256i structure.

How can this work be split efficiently accross multiple threads?

Preferred language is C++ (MSVC++2017 toolset v141), assembly is also great. Preferred CPU is x86_64 (intrinsics are ok). AVX2 is desired, if any benefit from it.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 2
    Hm... seems like a problem in memory bandwidth mostly. I'm not sure if there is really a better way than just doing it the obvious way. One approach might be to sort the array first so you can set the bits in order, making the cache much more efficient. – fuz Aug 07 '17 at 21:48
  • @fuz , yes, the problem is in memory contention. Particularly, I wonder whether it's faster to OR byte at once than 64-bit word at once. The first option may reduce contention (if CPU locks just 1 byte, rather than the whole word that byte belongs). The second option reduces the total number of memory accesses and squeezes as much as possible from each memory access (because RAM bus is 64-bit wide). – Serge Rogatch Aug 07 '17 at 21:53
  • 1
    Is `M` already sorted? If not, you would almost certainly want to optimize for a single thread. – zzxyz Aug 07 '17 at 22:04
  • 1
    Measure the performance with typical data with a few algorithms... Show us your code. By a bit vector, do you mean a `std::bitset` or a `std::vector` or something else. See also: [How can std::bitset be faster than std::vector?](https://stackoverflow.com/questions/4156538/how-can-stdbitset-be-faster-than-stdvectorbool). If your data is not already sorted and very large, it would be hard to optimize. **Also avoid premature optimization**. Only if you can prove that the obvious way is not enough. For small data size, overhead of thread or complex algorithm will make the code slower. – Phil1970 Aug 08 '17 at 00:27
  • 1
    On x86, lock or is going to lock an entire cache line, so you won't get any benefit from working with bytes instead of qwords. – prl Aug 08 '17 at 03:28
  • 1
    If the array is not sorted, consider using bts. Then you won't have to do any memory address arithmetic or bit shifting; just use the bit number directly. – prl Aug 08 '17 at 03:32
  • If M << N / 512, there won't be significant contention if the elements in the array are random. (512 is the number of bits in a cache line.) – prl Aug 08 '17 at 04:16
  • @prl, the array of positions is not sorted. What is "bts"? – Serge Rogatch Aug 08 '17 at 09:16
  • 1
    I agree with @prl that if M much smaller than N then contention will be low. But it could be reduced further if the updates are sorted and composed into a mask `m` that is 'ored' into place `w = w | m;` and instead (hard core) if T threads index by t [0...T-1] are each given cache lines such that addr % T == t. So each thread looks for updates in lines it 'owns'. I'm not saying either will improve performance because there's an overhead and M << N but I'm just pointing out at scale you want to cut up the target space (N) not the domain space (M). Alignment or over alignment may help also. – Persixty Aug 08 '17 at 09:56
  • BTS is bit test and set. (You don't care about the "test" part.) With a lock prefix, it allows atomically setting a single bit. Also it allows using the base address of the bit vector as the address in the instruction, and the bit number can be as large as you like. It automatically determines the byte to modify. – prl Aug 09 '17 at 04:38
  • Are the values in `M` fairly smoothly distributed? In particular, if you partition the vector of bits into `T` partitions (each with `N / T` bits), is likely that each partition will have _roughly_ the same number of "set" bits as implied by the values in `M`? – BeeOnRope Aug 21 '17 at 20:33
  • @BeeOnRope, I would like to see the solutions for both "yes" and "no" separately. If this property helps a better solution, let's consider it's true. – Serge Rogatch Aug 21 '17 at 20:46
  • @SergeRogatch - I added my solution below. The simple version works well with a smooth `M` and an easy refinement works with "kind of smooth M" (i.e., for distributions that are well distributed at a high level), and then finally you get handle any type of distribution of `M` with a fancier partitioning step. – BeeOnRope Aug 21 '17 at 20:55
  • @prl: `bts [mem], reg` is slower than doing the address math yourself and generating the bit-mask yourself (e.g. with `xor eax,eax` / `bts eax, reg`, because `bts r32,r32` is fast on Intel CPUs at least.) This is true for normal or `lock bts` vs. `lock or`. https://stackoverflow.com/questions/45556086/how-to-set-bits-of-a-bit-vector-efficiently-in-parallel/45805344?noredirect=1#comment78743792_45805344. (See also http://agner.org/optimize/. On both Ryzen and Skylake, non-`lock`ed `bts [mem], reg` has one per 5 cycle throughput, which is horrible vs. one per maybe 2c for `or` + ALU overhead) – Peter Cordes Aug 26 '17 at 16:20

3 Answers3

2

Let's assume you want to divide this work up among T threads. It's a pretty interesting problem since it isn't trivially parallelizable via partitioning and various solutions may apply for different sizes of N and M.

Fully Concurrent Baseline

You could simply divide up the array M into T partitions and have each thread work on its own partition of M with a shared N. The main problem is that since M is not sorted, all threads may access any element of N and hence stomp on each others work. To avoid this, you'd have to use atomic operations such as std::atomic::fetch_or for each modification of the shared N array, or else come up with some locking scheme. Both approaches are likely to kill performance (i.e., using an atomic operation to set a bit is likely to be an order of magnitude slower than the equivalent single-threaded code).

Let's look at ideas that are likely faster.

Private N

One relatively obvious idea to avoid the "shared N" problem which requires atomic operations for all mutations of N is simply to give each T a private copy of N and merge them at the end via or.

Unfortunately, this solution is O(N) + O(M/T) whereas the original single-threaded solution is O(M) and the "atomic" solution above is something like O(M/T)4. Since we know that N >> M this is likely to be a poor tradeoff in this case. Still, it's worth noting that the hidden constants in each term are very different: the O(N) term, which comes from the merging step0 can use 256-bit wide vpor instructions, meaning a throughput of something close to 200-500 bits/cycle (if cached), while the bit-setting step which is O(M/T) I estimate at closer to 1 bit/cycle. So this approach can certainly be the best one for moderate T even if the size of N is 10 or 100 times the size of M.

Partitions of M

The basic idea here is to partition the indexes in M such that each worker thread can then work on a disjoint part of the N array. If M was sorted, that would be trivial, but it's not, so...

A simple algorithm that will work well if M is smoothly distributed is to first partition that values of M into T buckets, with the buckets having values in the ranges [0, N/T), [N/T, 2N/T], ..., [(T-1)N/T, N). That is, divide N into T disjoint regions and then find the values of M that fall into each of them. You can spread that work across the T threads by assigning each thread an equal size chunk of M, and having them each create the T partitions and then logically merging1 them at the end so you have the T partitions of M.

The second step is to actually set all the bits: you assign one partition to each thread T which can set the bits in a "single threaded" way, i.e., not worrying about concurrent updates, since each thread is working on a disjoint partition of N2.

Both steps O(M) and the second step is identical to the single-threaded case, so the overhead for parallelizing this is the first step. I suspect the first will range from about the same speed as the second to perhaps 2-4 times as slow, depending on implementation and hardware, so you can expect a speedup on a machine with many cores, but with only 2 or 4 it might not be any better.

If the distribution of M is not smooth, such that the partitions created in the first step have very different sizes, it will work poorly because some threads will get a lot more work. A simple strategy is to create say 10 * T partitions, rather than only T and have the threads in the second pass all consume from the same queue of partitions until complete. In this way you spread the work more evenly, unless the array M is very bunched up. In that case you might consider a refinement of the first step which first essentially creates a bucketed histogram of the elements, and then a reduce stage which looks at the combined histogram to create a good partitioning.

Essentially, we are just progressively refining the first stage into a type of parallel sort/partitioning algorithm, for which there is already lots of literature. You might even find that a full (parallel) sort is fastest, since it will greatly help in bit-setting phase, since accesses will be in-order and have the best spatial locality (helping with prefetching and caching, respectively).


0 ... and also from the "allocate a private array of length N" step, although this is likely to be quite fast.

1 The conceptually simplest form of merging would be to simply copy each thread's partitions of M such that you have a contiguous partition of all of M, but in practice if the partitions are large you can just leave the partitions where they are and link them together, adding some complexity to the consuming code, but avoiding the compacting step.

2 To make it truly disjoint from a threading point of view you want to ensure the partition of N falls on "byte boundaries", and perhaps even cache-line boundaries to avoid false sharing (although the latter is likely not to be a big problem since it only occurs at the edge of each partition, and the order of processing means that you are not likely to get contention).

4 In practice, the exact "order" of the baseline concurrent solution using shared N is hard to define because there will be contention so the O(M/T) scaling will break down for large enough T. If we assume N is quite large and T is limited to typical hardware concurrency of at most a dozen cores or so it's probably an OK approximation.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I would say the obvious shared-N baseline is `fetch_or`. CAS-retry would be much worse on x86. `fetch_or` can compile to `lock bts [N], rax` if you don't use the result. Or to a shift or `bts` in a register + `lock or`, which may be faster. (It's definitely faster for the non-`lock` case, because `bts [mem], reg` is many more uops because of the crazy bit-string semantics.) On Haswell, `lock add` is 8 uops, but non-`lock`ed `bts [mem],reg` is 10. `lock bts` is probably several more uops, but the throughput may still be the same. – Peter Cordes Aug 26 '17 at 04:23
  • If we're talking x86 with its unusually wide variety of atomic operations, sure. Are we talking x86? I should add that anyway... @PeterCordes – BeeOnRope Aug 26 '17 at 04:32
  • The OP is definitely talking x86, and you mention AVX2 `vpor`. Also, load + OR + CAS-with-LL/SC is worse than LL/OR/SC, especially if hardware has any special support for hardware arbitration of LL/SC. – Peter Cordes Aug 26 '17 at 04:33
  • Hmm, I thought for a second that AVX512 gather/scatter could be used without conflict-detection because setting an already-set bit is ok. But no, the problem is when you need to set 2 different bits in the same dword. Great point about sorting M being good for locality in N. A partial sort that leaves it mostly sorted could be very good. – Peter Cordes Aug 26 '17 at 04:40
  • You mean `lock or`, not `fetch or`. Anyway, writing in C you already have to write code that addresses a `uint32_t`. Only `vector` has an interface that's actually like what `bts [mem]` gives you, and that's implemented in C++ rather than with built-in functions, so the compiler would have to turn the OR-into-a-dword pattern back into a `lock bts` with a peephole optimization. It could if `bts` was fast, but it's very not surprising that is wasn't worth anyone's time to make compilers spot that, even for `-Os`. It was silly of me to say `fetch_or` could compiler to `lock bts`. – Peter Cordes Aug 26 '17 at 04:46
  • Gather scatter for the single-threaded (or otherwise properly partitioned to avoid conflicts), right? I didn't focus too much on the bit setting part since I figured you could probably write a kernel for that that gets to 1 store per cycle (perhaps with some SIMD help for the generating the indexes & setting the `(1UL << pos)` bit. I figure the partial sort will be the slower part. – BeeOnRope Aug 26 '17 at 04:48
  • All of gcc, clang and icc seem to like [lock or](https://godbolt.org/g/knoimJ). Maybe because bts with mem arg is just too... crazy? – BeeOnRope Aug 26 '17 at 04:49
  • Yes, I was thinking of gather/scatter for the bit-setting kernel once you've solved the multi-threading part. – Peter Cordes Aug 26 '17 at 04:51
  • @PeterCordes - well it's not too crazy to think it could use `lock bts`. Some compilers certainly already use (reg dest) `bts` when you give it a `uin64_t` pattern with the `| (1UL << pos)` type thing. That seems at the same difficultly level as mem dest `bts`? It certainly would avoid unlocked `bts [mem]` as you point out and they aren't going to go back and put it back in for atomics, where it is at most a tie and possibly terrible (I have no numbers). – BeeOnRope Aug 26 '17 at 04:52
  • I meant using `mov eax, [M + rcx]` / `lock bts [N], eax` for the *whole* problem, actually taking advantage of `bts`'s ability to index outside the dword selected by the addressing mode to save the work of copying + right-shifting the register to get a byte index. Saving that much work is what could make `bts [mem],reg` worth using, since it does exactly that for you. That doesn't come up in the register-dest case, and it makes sense that compilers use it there because `bts r,r` is 1 uop on Intel (2 on Ryzen). Compiler writers know not to use a peephole to generate mem-dest `bts` for that. – Peter Cordes Aug 26 '17 at 04:59
  • Oh yeah, using the full indexing of `bts [mem]` is probably not ever going to happen. It's too bad that `bts [mem], reg` just sucks. The immediate version is close to being OK I guess. – BeeOnRope Aug 26 '17 at 05:13
  • Just tested on Skylake in a loop that updates a different dword every iteration (sequentially over a 4k buffer): `lock bts [rdi], edx` / `add edx, 33` / `and edx, 4096*8 - 1` (and looping with sub/jg): **`lock bts` takes 475M cycles** for 25M iterations (525M uops issued (fused domain), 400M uops_executed (unfused domain)). A similar loop with **`lock or` takes 450M** cycles. issued=425M, executed=275M. (that's with 4 integer uops + the loop condition: `xor eax,eax` / `bts eax, ebp` (loop counter) / `add edx, 4` / `and edx, 4096 - 1` / `lock or [rdi + rdx], eax` / sub+jg.) – Peter Cordes Aug 26 '17 at 05:46
  • Should it be `add edx, 32` not `33` to be comparable and avoid lots of split locks? Interesting results, and about what I'd expect: the `lock` cost is very large and probably dominates most of the rest - usually about 15-20 cycles and that's exactly what you get (475 / 25 = 19). Note the low IPC of the `lock or` approach, while the `bts` approach was able to mostly hide it's crapload of uops under the `lock` latency. @PeterCordes – BeeOnRope Aug 26 '17 at 05:51
  • I'm incrementing the bit index by 33 so it occasionally skips a dword (without making things too "easy", although 32 should perform the same). It's up to `lock bts` to do aligned accesses to the containing byte (`[rdi]` is 64B-aligned). It's interesting that `lock bts` throughput is lower than `lock or` throughput even in the ideal case. My `lock or` throughput matches Agner's 18 cycles, but `lock bts` is 19 cycles. So I think 1 uop of that overhead ended up in the critical path. – Peter Cordes Aug 26 '17 at 05:55
  • Yes, it's probably something minor like after waiting for the store buffer to flush (to support the implied fence of `lock`) there is just a one cycle longer latency in the mess of `bts` uops to starting the critical path leading to the next locked op. Perhaps the extra latency just comes from the more complex address calculation - it seems like it could run in parallel but maybe the initial load has a one cycle longer latency (like how some addressing modes have +1 latency). – BeeOnRope Aug 26 '17 at 06:00
  • Changing to `add edx, 65` / and / `bts [rdi], rdx` runs in 500Mc (**20c throughput for `lock bts [rdi], r64`** on SKL). With `add edx, 33` so most pairs of accesses are to the same qword, I'm seeing 560.889Mc, or 22.43c throughput. – Peter Cordes Aug 26 '17 at 06:02
  • So I guess the only time it's worth using `lock bts` is when you actually want the flag result for the old value of the bit. Any other time it's better to use `xor`-zero / `bts reg,reg` / address calc / `lock or`. (Not counting code-size / L1I pressure reasons.) So, only time it's worth using in a loop anyway. – Peter Cordes Aug 26 '17 at 06:07
  • 1
    Or `shlx` can replace the `xor` and `bts` if you have a register with a 1 initialized outside the loop. – BeeOnRope Aug 26 '17 at 06:09
  • It's weird that changing the reg to `r64` somehow made the `lock bts` loop with increment 33 regress from 19 to 22.43 cycles if I'm understanding the above? @PeterCordes – BeeOnRope Aug 26 '17 at 06:11
  • More like from 20 to 22.43, since `lock bts [mem], r64` has 20c throughput with increment=65, but yes it's weird. (It's also weird that larger operand-size is slower at all). I suspect that it ends up doing a `qword` memory access for that instead of a `dword` access (maybe it's easier for microcode to have the memory operand-size match the register operand-size? The manual allows this; it doesn't require `bt*` to do single-byte accesses). I guess the latency of 2 subsequent atomic accesses to the same qword is high enough to be a problem, or it causes a bubble somewhere. – Peter Cordes Aug 26 '17 at 06:15
  • 1
    It could be explained store forwarding. The next iteration's read hits the store from the previous iteration if reads/writes are now 8-bytes. Although in my mental mode there is not actually any store forwarding since the implied fence from the locked op should not allow the later loads to proceed until the SB is empty, but who knows how it all pans out in practice. A bunch of back-to-back atomic ops is not exactly common anyways. – BeeOnRope Aug 26 '17 at 06:20
  • Right, but why can't OOOE hide the store-forwarding or whatever latency? The pairs of dependent `lock bts` instructions are all independent of other pairs. (related: with all accesses going to the same byte (bit alternating between 0 and 1, everything else the same), `lock bts [rdi], r32/r64` are the same, at 25c latency. – Peter Cordes Aug 26 '17 at 06:22
  • Because the memory access _is_ the critical dependency chain. It's not the usual one where a read feeds into the address for the next one, but it is implicitly like that due to the fence. All instructions after the `lock`ed one implicitly depend on it (or at least the next read does). So the subsequent read is part of the chain leading up to the next fence and that's why you get the long latency. – BeeOnRope Aug 26 '17 at 06:25
  • Oh, I see what you're saying. The 20c best-case is because of the barrier, and store-forwarding (or cache-access) latency on top of that loses time that you can never get back. – Peter Cordes Aug 26 '17 at 06:29
  • Hyperthreading would probably have near-perfect efficiency on the baseline `lock or` implementation, since each logical thread has its own store-buffer to flush in parallel. (The OP has a Ryzen, but I assume it's similar.) – Peter Cordes Aug 26 '17 at 06:36
  • Yes. The back to back `lock` guys are latency limited and reads can be part of that chain too (including the implied read in the next locked guy). I'm curious, for example, what the impact of well-spaced `lock` instructions is in a stream of unrelated dependent reg,reg operations. For example, a string of dependent `mul` and a `lock or` every 10 instructions. I'm guessing the cost may be close to zero since 10 mul is slower than 1 `lock or`. – BeeOnRope Aug 26 '17 at 06:36
  • I'm also curious if the same location sensitivity happens with lock or: does it also degrade to 24 or 25c when the same location is used over and over? – BeeOnRope Aug 26 '17 at 06:38
  • 1
    I tried it with `times 10 imul ecx,ecx` and commenting out (or not) the `lock or` block. The difference (if any) is below the measurement noise level, at about 750.4Mc for 25M iters. – Peter Cordes Aug 26 '17 at 06:38
  • Based on what I saw in the question I made specifically asking how awesome sharing data between hyperthreads would be I guess I would say "anything can happen" but yeah in principle they should be fine since they are not hitting the same data like in my question. Still I don't know why the SB flush takes 15 cycles or so - what's happening in that time? The SB should be empty, so perhaps this is some operation that blocks threads at the hardware level? Are the SBs dedicated/statically partitioned in HT? – BeeOnRope Aug 26 '17 at 06:42
  • 1
    Ha! Neat finding reading minimum lock latency. So we can say that lock can be totally free, depending. In fact when they are used for mutex acquisition this normally doesn't help because the first thing you probably do in a mutex is read from memory (after all, you're protecting memory), so you usually end up paying the full penalty in that case. A fire and forget increment of an atomic counter followed by enough reg,reg work could be one place it could be free. Interesting optimization opportunity... – BeeOnRope Aug 26 '17 at 06:45
  • 2
    Yes, Intel states clearly that HT *statically* partitions the store buffer, so each logical thread has its own. (https://stackoverflow.com/questions/27797424/with-hyper-threading-threads-of-one-physical-core-are-exchanging-via-what-level/27902942#27902942) – Peter Cordes Aug 26 '17 at 06:48
  • It also isn't clear to me if the fence also impedes subsequent stores. In principle it doesn't have to because the SB enforces their order, but the mechanics of flushing the SB may not allow later stores to be written during this period. – BeeOnRope Aug 26 '17 at 06:50
  • 1
    Good suggestions to try `lock or` for the same-address case. It's no slower, still 18c. That makes `lock bts` look even weirder. – Peter Cordes Aug 26 '17 at 06:51
  • I have no explanation there (well the or case is the one that makes more sense to me...). – BeeOnRope Aug 26 '17 at 06:53
  • 1
    re: lock overhead: I tried adding an `add [rsi + rdx], 1` into my `lock or` loop (rsi = rdi + 4k). That slowed it down to 34c per iter. But removing the 4k-aliasing with `[rsi + rdx + 24]` makes it 25c per iter (same as same-address `lock bts` from earlier, in case that's not a coincidence). The non-locked `add` is going to the same address every time. With a pure store rather than RMW, it's also 25c per iter between `lock or [same], reg` – Peter Cordes Aug 26 '17 at 06:58
  • Huh, so in the 4k case store forwarding is attempted, fails and all that adds to the critical chain. The 25c result is interesting. Perhaps additional cost to the implied fence with another store in the SB? – BeeOnRope Aug 26 '17 at 07:04
  • In the 4k-aliasing case, the non-locked `add` is treated as dependent on the `lock or`, *and vice versa*, because the memory disambiguation hardware only checks the bits below the page offset. I guess you might be right that it's the same as a failed store-forwarding. BTW, making sure the addresses are ready far ahead of time doesn't help. (`[rdi]` instead of `[rdi + rdx]`, since the `and edx,1` is still in the loop). OOO was getting rdx ready far ahead of time anyway, since a memory barrier doesn't serialize non-memory instructions. – Peter Cordes Aug 26 '17 at 07:19
  • 1
    4k-aliasing isn't a problem for a pure store (625Mc same as with just `lock or`), only a load (750Mc) or a RMW (850Mc). Hrm, using `[rsi + rdx]` and `[rdi + rdx]` is somehow than non-indexed, even though `rdx` is always 0. – Peter Cordes Aug 26 '17 at 07:21
  • Using `mov edx, [rsi + rdx]` and `lock or [rdi + rdx], eax` is somehow faster (around 708Mc +- 1) than non-indexed (and inconsistent), even though `rdx` is always 0. But non-indexed for the load is faster, like 656Mc. – Peter Cordes Aug 26 '17 at 07:27
  • Yes exactly, memory disambiguation is there to do store forwarding. There's a predictor there, so you'd imagine that this would quickly get flagged as "does not forward" but maybe that's not really what the predictor does (maybe it just predicts which loads can issue before earlier stores without having to be redone since a store ended up hitting them). I guess the fact that 4k aliasing isn't trivially solved by the predictor in general is evidence it doesn't work that way! – BeeOnRope Aug 26 '17 at 07:28
  • 1
    Yup makes sense that store is no issue. You need a read to trigger the wrong forwarding. Stores can kill other stores no problem or different strategy needed. – BeeOnRope Aug 26 '17 at 07:31
2

@IraBaxter posted an interesting but flawed idea which can be made to work (at significant cost). I suspect @BeeOnRope's idea of partial-sort / partitioning the M array will perform better (especially for CPUs with large private caches which can keep parts of N hot). I'll summarize the modified version of Ira's idea that I described in comments on his deleted answer. (That answer has some suggestions about how big N has to be before it's worth multi-threading.)


Each writer thread gets a chunk of M with no sorting/partitioning.

The idea is that conflicts are very rare because N is large compared to the number of stores that can be in flight at once. Since setting a bit is idempotent, so we can handle conflicts (where two threads want to set different bits in the same byte) by checking the value in memory to make sure it really does have the bit set that we want after a RMW operation like or [N + rdi], al (with no lock prefix).

E.g. thread 1 tried to store 0x1 and stepped on thread 2's store of 0x2. Thread 2 must notice and retry the read-modify-write (probably with lock or to keep it simple and make multiple retries not possible) to end up with 0x3 in the conflict byte.

We need an mfence instruction before the read-back. Otherwise store-forwarding will give us the value we we just wrote before other threads see our store. In other words, a thread can observe its own stores earlier than they appear in the global order. x86 does have a Total Order for stores, but not for loads. Thus, we need mfence to prevent StoreLoad reordering. (Intel's "Loads Are not Reordered with Older Stores to the Same Location" guarantee is not as useful as it sounds: store/reload isn't a memory barrier; they're just talking about out-of-order execution preserving program-order semantics.)

mfence is expensive, but the trick that makes this better than just using lock or [N+rdi], al is that we can batch operations. e.g. do 32 or instructions and then 32 read-back. It's a tradeoff between mfence overhead per operation vs. increased chance of false-sharing (reading back cache lines that had already been invalidated by another CPU claiming them).

Instead of an actual mfence instruction, we can do the last or of a group as a lock or. This is better for throughput on both AMD and Intel. For example, according to Agner Fog's tables, mfence has one per 33c throughput on Haswell/Skylake, where lock add (same performance as or) has 18c or 19c throughput. Or for Ryzen, ~70c (mfence) vs. ~17c (lock add).

If we keep the amount of operations per fence very low, the array index (m[i]/8) + mask (1<<(m[i] & 7)) can be kept in registers for all the operations. This probably isn't worth it; fences are too expensive to do as often as every 6 or operations. Using the bts and bt bit-string instructions would mean we could keep more indices in registers (because no shift-result is needed), but probably not worth it because they're slow.

Using vector registers to hold indices might be a good idea, to avoid having to reload them from memory after the barrier. We want the load addresses to be ready as soon as the read-back load uops can execute (because they're waiting for the last store before the barrier to commit to L1D and become globally visible).

Using single-byte read-modify-write makes actual conflicts as unlikely as possible. Each write of a byte only does a non-atomic RMW on 7 neighbouring bytes. Performance still suffers from false-sharing when two threads modify bytes in the same 64B cache-line, but at least we avoid having to actually redo as many or operations. 32-bit element size would make some things more efficient (like using xor eax,eax / bts eax, reg to generate 1<<(m[i] & 31) with only 2 uops, or 1 for BMI2 shlx eax, r10d, reg (where r10d=1).)

Avoid the bit-string instructions like bts [N], eax: it has worse throughput than doing the indexing and mask calculation for or [N + rax], dl. This is the perfect use-case for it (except that we don't care about the old value of the bit in memory, we just want to set it), but still its CISC baggage is too much.

In C, a function might look something like

/// UGLY HACKS AHEAD, for testing only.

//    #include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void set_bits( volatile uint8_t * restrict N, const unsigned *restrict M, size_t len)
{
    const int batchsize = 32;

    // FIXME: loop bounds should be len-batchsize or something.
    for (int i = 0 ; i < len ; i+=batchsize ) {
        for (int j = 0 ; j<batchsize-1 ; j++ ) {
           unsigned idx = M[i+j];
           unsigned mask = 1U << (idx&7);
           idx >>= 3;
           N[idx] |= mask;
        }

        // do the last operation of the batch with a lock prefix as a memory barrier.
        // seq_cst RMW is probably a full barrier on non-x86 architectures, too.
        unsigned idx = M[i+batchsize-1];
        unsigned mask = 1U << (idx&7);
        idx >>= 3;
        __atomic_fetch_or(&N[idx], mask, __ATOMIC_SEQ_CST);
        // _mm_mfence();

        // TODO: cache `M[]` in vector registers
        for (int j = 0 ; j<batchsize ; j++ ) {
           unsigned idx = M[i+j];
           unsigned mask = 1U << (idx&7);
           idx >>= 3;
           if (! (N[idx] & mask)) {
               __atomic_fetch_or(&N[idx], mask, __ATOMIC_RELAXED);
           }
        }
    }
}

This compiles to approximately what we want with gcc and clang. The asm (Godbolt) could be more efficient in several ways, but might be interesting to try this. This is not safe: I just hacked this together in C to get the asm I wanted for this stand-alone function, without inlining into a caller or anything. __atomic_fetch_or is not a proper compiler barrier for non-atomic variables the way asm("":::"memory") is. (At least the C11 stdatomic version isn't.) I should probably have used the legacy __sync_fetch_and_or, which is a full barrier for all memory operations.

It uses GNU C atomic builtins to do atomic RMW operations where desired on variables that aren't atomic_uint8_t. Running this function from multiple threads at once would be C11 UB, but we only need it to work on x86. I used volatile to get the asynchronous-modification-allowed part of atomic without forcing N[idx] |= mask; to be atomic. The idea is to make sure that the read-back checks don't optimize away.

I use __atomic_fetch_or as a memory barrier because I know it will be on x86. With seq_cst, it probably will be on other ISAs, too, but this is all a big hack.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

There are a couple of operations involved in sets (A,B = set, X = element in a set):

Set operation           Instruction
---------------------------------------------
Intersection of A,B     A and B
Union of A,B            A or B
Difference of A,B       A xor B
A is subset of B        A and B = B     
A is superset of B      A and B = A       
A <> B                  A xor B <> 0
A = B                   A xor B = 0
X in A                  BT [A],X
Add X to A              BTS [A],X
Subtract X from A       BTC [A],X

Given the fact that you can use the boolean operators to replace set operations you can use VPXOR, VPAND etc.
To set, reset or test individual bits you simply use

mov eax,BitPosition
BT [rcx],rax

You can set if a set is (equal to) empty (or something else) using the following code

vpxor      ymm0,ymm0,ymm0       //ymm0 = 0
//replace the previous instruction with something else if you don't want
//to compare to zero.
vpcmpeqqq  ymm1,ymm0,[mem]      //compare mem qwords to 0 per qword
vpslldq    ymm2,ymm1,8          //line up qw0 and 1 + qw2 + 3
vpand      ymm2,ymm1,ymm2       //combine qw0/1 and qw2/3
vpsrldq    ymm1,ymm2,16         //line up qw0/1 and qw2/3
vpand      ymm1,ymm1,ymm2       //combine qw0123, all in the lower 64 bits.
//if the set is empty, all bits in ymm1 will be 1.
//if its not, all bits in ymm1 will be 0.     

(I'm sure this code can be improved using the blend/gather etc instructions) From here you can just extend to bigger sets or other operations.

Note that bt, btc, bts with a memory operand is not limited to 64 bits.
The following will work just fine.

mov eax,1023
bts [rcx],rax   //set 1024st element (first element is 0).
Johan
  • 74,508
  • 24
  • 191
  • 319
  • The problem is rather to set bits to `1` efficiently in parrallel (multiple threads), given an array of bit indexes to set to `1` (and leave the other bits unchanged). – Serge Rogatch Aug 10 '17 at 16:22
  • and's and or's are your friend, as detailed above – Johan Aug 11 '17 at 08:46