optimizing percpu 2 level bit vector in restartable sequence on x86-64, skylake

Question

I am curious for how I can best optimize the assembly below, specifically the part in the code block under "JUMP HERE TO SEE THE ASSEMBLY" (for easy control-f searching).

I am writing some code and the HOT HOT HOT path is basically finding a 0 bit in a bit vector and returning said bit.

the bit vector is comprised of:

struct 2l_bitvec {
       // outer vector with bits indicating with inner vectors have available slots
       uint64_t v1;

       // inner vector with actual index bits
       uint64_t v2[64];
} 2l_bitvec;

Each cpu has a bitvec (or multiple chained together in a much slower path data structure).

To manage coherency within these bit vectors I am using restartable sequences (scroll down a bit for the best manpage i've been able to find).

Due to using rseq (and this being super hot code) the logic is all written in inline assembly.

The C code for what I am trying to write is as follows:

#define LIKELY(X)   __builtin_expect(!!(X), 1)
#define UNLIKELY(X) __builtin_expect((X), 0)
uint64_t __attribute__((noinline))
restarting_l2_set_idx(uint64_t * v1, const uint32_t start_cpu) {
    
// if ever preempted, migrated, or catch a signal return here
catch_something_label:
    
    if (start_cpu != __rseq_abi.cpu_id_start) {
        return 4097;
    }

    uint64_t temp_v1 = *v1;
    while (LIKELY(temp_v1 != (~(0UL)))) {
        const uint32_t idx_v1  = _tzcnt_u64((~temp_v1));
        
        uint64_t       temp_v2 = v1[idx_v1 + 1];
        if (LIKELY(temp_v2 != (~(0UL)))) {
            const uint32_t idx = _tzcnt_u64(~temp_v2);
            
            temp_v2 |= ((1UL) << idx);
            v1[idx + 1] = temp_v2;
            
            return 64 * idx_v1 + idx;
        }
        else {
            temp_v1 |= ((1UL) << idx_v1);
            *v1 = temp_v1;
        }
    }
    
    return -1;
}

There is some rseq setup stuff which is basically:

#define RSEQ_INFO_DEF(alignment)                                               \
    ".pushsection __rseq_cs, \"aw\"\n\t"                                       \
    ".balign " #alignment                                                      \
    "\n\t"                                                                     \
    "3:\n\t"                                                                   \
    ".long 0x0\n"                                                              \
    ".long 0x0\n"                                                              \
    ".quad 1f\n"                                                               \
    ".quad 2f - 1f\n"                                                          \
    ".quad 4f\n"                                                               \
    ".popsection\n\t"

/*
    ".pushsection __rseq_cs, \"aw\"\n\t"    // creation section
    ".balign " #alignment"\n\t"             // alignment at least 32
    "3:\n\t"                                // struct info jump label
                                            // struct is rseq_info
    ".long 0x0\n"                           // version = 0
    ".long 0x0\n"                           // flags = 0
    ".quad 1f\n"                            // start_ip = 1f (label 1, forward)
    ".quad 2f - 1f\n"                       // post_commit_offset = (start_cs
                                               label - end_cs label)
    ".quad 4f\n"                            // abort label = 4f (label 4)
    ".popsection\n\t"                       // end section
*/


#define RSEQ_CS_ARR_DEF()                                                      \
    ".pushsection __rseq_cs_ptr_array, \"aw\"\n\t"                             \
    ".quad 3b\n\t"                                                             \
    ".popsection\n\t"

/*
    ".pushsection __rseq_cs_ptr_array, \"aw\"\n\t"  // create ptr section
    ".quad 3b\n\t"                                  // set ptr to addr of
                                                       rseq_info
    ".popsection\n\t"                               // end section
*/

#define RSEQ_PREP_CS_DEF(TEMP_REGISTER)                               \
    "leaq 3b (%%rip), " V_TO_STR(TEMP_REGISTER) "\n\t"                         \
    "movq " V_TO_STR(TEMP_REGISTER) ", %%fs:__rseq_abi@tpoff+8\n\t"          \



/*
    "leaq 3b (%%rip), REGISTER\n\t"     // get set for rseq_info struct
    "movq REGISTER, 8(%[rseq_abi])\n\t" // store in ptr field in __rseq_abi
*/

#define RSEQ_CMP_CUR_VS_START_CPUS()                                           \
    "cmpl %[start_cpu], %%fs:__rseq_abi@tpoff+4\n\t"

/*
    "cmpl %[start_cpu], 4(%[rseq_abi])\n\t" // get cpu in 4(%[rseq_abi]) and
                                               compare to %[start_cpu] which is
                                               passed as param to function
*/


// sometimes this is better to put in the
// same code section as the critical section
#define RSEQ_START_ABORT_DEF()                                                 \
    ".pushsection __rseq_failure, \"ax\"\n\t"                                  \
    ".byte 0x0f, 0xb9, 0x3d\n\t"                                               \
    ".long 0x53053053\n\t"                                                     \
    "4:\n\t"                                                                   \

/*
  ".pushsection __rseq_failure, \"ax\"\n\t" // create failure section
    ".byte 0x0f, 0xb9, 0x3d\n\t"            // Disassembler-friendly signature:
                                               ud1 <sig>(%rip),%edi
    ".long 0x53053053\n\t"                  // invalid operation to avoid code
                                               injection 
    "4:\n\t"                                // abort label
*/

#define RSEQ_END_ABORT_DEF() ".popsection\n\t"

/*
    ".popsection\n\t"   // end failure section
*/

In pseudo code what the assembly with all the rseq stuff looks like is:

/*
Type assembly will look like as follow:
foo(..., uint32_t start_cpu) 
    RSEQ_INFO_DEF(32) 
    RSEQ_CS_ARR_DEF() 
    RSEQ_PREP_CS_DEF()

    // maybe some setup stuff (or maybe abort)

    "1:\n\t"    

    RSEQ_CMP_CUR_VS_START_CPUS()
    // handle migrated somehow

    <actual critical section here>
    "2:\n\t" (this is end label of critical section)

    // if abort is in another code section
    RSEQ_START_ABORT_DEF()
    <logical for abort here>
        // if this is goto generally jmp %l[abort]
        // otherwise some actual logic (usually set return var)
    RSEQ_END_ABORT_DEF()
    : <output variables, only if NOT goto asm>
    : <input variables> +
     [ start_cpu ] "g"(start_cpu), // always
    : <clobber registers> +
      "memory", "cc" // minimum clobbers
    #ifdef IS_GOTO_ASM
    : <jump labels OUTSIDE of the asm>
    #endif
*/

The assembly is optimized around the fact that the VAST majority of aborts are due to preemption and not migration so generally abort just jumps back to checking current cpu and continuing (because the comparison succeeds)

The assembly code I am using is as follows:

JUMP HERE TO SEE THE ASSEMBLY

#define PRIMITIVE_V_TO_STR(X) #X
#define V_TO_STR(X) PRIMITIVE_V_TO_STR(X)

#define _FAILURE_MIGRATED 4097

// inlining the function often breaks stuff, so while testing I am skipping that
// aligning to cache line seems to actually affect performance significantly

uint64_t __attribute__((noinline))
__attribute__((aligned(64)))
    restarting_2l_set_idx(uint64_t * const v1, const uint32_t start_cpu) {
    // return [0 - 4095] -> success (that is the index)
    // return [4097] -> failure the thread migrated
    // return [-1] -> failure the bit vector is full
    
#pragma GCC diagnostic ignored "-Wuninitialized"
    // pin for return so compiler doesnt fuck up
    register uint64_t idx asm("rax");

    // some temps I trust the compiler to allocate smartly
    uint64_t * v2;
    uint64_t idx_v1, temp_v1, temp_v2;
#pragma GCC diagnostic push

    // clang-format off
    asm volatile(
        RSEQ_INFO_DEF(32)
        RSEQ_CS_ARR_DEF()

        // any register will do
        RSEQ_PREP_CS_DEF(%[temp_v1])

        "mov $" V_TO_STR(_FAILURE_MIGRATED) ", %[idx]\n\t"

#ifdef FAST_ABORT
        // skip abort first time
        "jmp 1f\n\t"
        
        ".byte 0x0f, 0xb9, 0x3d\n\t"            // Disassembler-friendly signature: ud1 <sig>(%rip),%edi
        ".long 0x53053053\n\t"                  // invalid operation to avoid code injection 
        "4:\n\t"                                // abort label

        ".byte 0x0f, 0xb9, 0x3d\n\t"
        ".long 0x53053053\n\t"
        "4:\n\t"
        "mov $" V_TO_STR(_FAILURE_MIGRATED) ", %[idx]\n\t"
#endif
        
        // start critical section
        "1:\n\t"
        
        // check if migrated        
        RSEQ_CMP_CUR_VS_START_CPUS()
        // if migrated goto 2:
        "jnz 2f\n\t"

        // if not migrated temp_v = *v
        "movq (%[v1]), %[temp_v1]\n\t"

        // start loop: while(temp_v1 != -1)
        "5:\n\t"
                
        // idx = ~temp_v
        "movq %[temp_v1], %[idx]\n\t"

                
        // The reason we can't do this cmp after notq %[idx]
        // (and use testq) is because
        // 0 is a valid idx to return whereas -1 is not
        // (also why setting idx before the comparison)

        // if (%[v1]) is full leave. 
        // This branch is VERY unexpected.
        "cmpq $-1, %[idx]\n\t"
        "jz 2f\n\t"
        
        "notq %[idx]\n\t"
        
        // idx_v1 = tzcnt(idx) (find first one)
        "tzcntq %[idx], %[idx_v1]\n\t"

        // if registers are tight v2 could be in
        // memory and could use [idx] as a temporary
        // temp_v2 = v[idx_v1 + 1]
        "leaq 8(%[v1],%[idx_v1],8), %[v2]\n\t"
        "movq (%[v2]), %[temp_v2]\n\t"

        // test if temp_v2 is full
        "cmpq $-1, %[temp_v2]\n\t"
        "jz 7f\n\t" // 7f is btsq %[idx_outer], %[temp_v1], jmp 5b
        
        // idx = ~temp_v2
        "movq %[temp_v2], %[idx]\n\t"
        "notq %[idx]\n\t"
        // could replace the cmpq $-1, %[temp_v2], jz above with
        // testq %[idx], %[idx], jz here

        // idx = tzcnt(idx)
        "tzcntq %[idx], %[idx]\n\t"

        // temp_v2 |= 1 << idx
        "btsq %[idx], %[temp_v2]\n\t"
        "jmp 9f\n\t"

        "7:\n\t"
        "btsq %[idx_v1], %[temp_v1]\n\t"
        
        // this is a completely valid state to be migrated out after
        // (all we have really done is cleaned up v1 vector a bit)
        // because we can be migrated out here we don't check/set if
        // temp_v2 is full as that could lead to invalid state in v1
        "movq %[temp_v1], (%[v1])\n\t"

        // this is } in while loop starting at 5:
        "jmp 5b\n\t"

        // prepare for commit and commit
        "9:\n\t"
        
        // temp_v2 |= 1UL << idx
        "btsq %[idx], %[temp_v2]\n\t"
               
        // prepare success return
        "salq $6, %[idx_v1]\n\t"
        "addq %[idx_v1], %[idx]\n\t"
        
        // commit
        "movq %[temp_v2], (%[v2])\n\t"

        // end critical section
        "2:\n\t"

#ifndef FAST_ABORT
          RSEQ_START_ABORT_DEF()
        // given that the critical section is fairly involved
        // it may be worth it to put this in the same code section
        // as critical section for faster aborts
        "mov $" V_TO_STR(_FAILURE_MIGRATED) ", %[idx]\n\t"
        "jmp 1b\n\t"
        RSEQ_END_ABORT_DEF()
#endif

        : [ idx] "+r" (idx)
        : [ idx_v1 ] "r" (idx_v1),
          [ temp_v2 ] "r" (temp_v2),
          [ temp_v1 ] "r" (temp_v1),
          [ v2 ] "r" (v2), 
          [ v1 ] "g" (v1),
          [ start_cpu] "g" (start_cpu)
        : "memory", "cc");

    return idx;
}

After correctness my first, second, and third goal is for this to be fast. All optimizations have to be under the consideration that the code can jump to abort after any instruction (hence why the commit on from temp_v2 to v2 is the final instruction of the critical section). And if the abortion was due to thread migration the function cant write to any of the data (otherwise there will be serious race conditions).

If you want to run / compile this in userspace you will need to include the linux/rseq.h header. A nice "hello world" for the setup is here and/or at librseq.

Note: I'm posting this here instead of on codereview.SE because my primary question is about how I can make the assembly in my critical section restarting_l2_set_idx faster.

Edit: @PeterCordes

Suggested replacing the leaq here:

        "leaq 8(%[v1],%[idx_v1],8), %[v2]\n\t"
        "movq (%[v2]), %[temp_v2]\n\t"

I changed it to this

        "movq %[v1], %[v2]\n\t"         // v2 = v1
        "salq $3, %[idx_v1]\n\t"        // idx_v1 = 8 * idx_v1
        "addq %[idx_v1], %[v2]\n\t"     // v2 += idx_v1 (index by uint64_t)
        "movq 8(%[v2]), %[temp_v2]\n\t" // temp_v2 = *(v + 8)

Aswell since idx_v1 now has 8 x the bit position it represents the following code also changes:

        // in 7: label
        "btsq %[idx_v1], %[temp_v1]\n\t"

to

        "sarq $3, %[idx_v1]\n\t"
        "btsq %[idx_v1], %[temp_v1]\n\t"

and

        // in 9: label
        "salq $6, %[idx_v1]\n\t"

to

        "salq $3, %[idx_v1]\n\t"

I am not sure if this actually is a performance improvement however. I think it may be stifled by the fact that I do need to store v2 for the commit.

Edit2: @PeterCordes pointed out my edit was dumb: I can drop the v2 temporary and use movq 8(%[v1],%[idx_v1],8), %[temp_v2] to get temp_v2 and movq %[temp_v2], 8(%[v1],%[idx_v1],8) to store it. Sorry for my naive first edit :(

*inlining the function often breaks stuff* - because your constraints are broken. You modify the `%[temp_v1]` register, but you told the compiler it was an `"r"` input-only operand. It looks like you just want the compiler to allocate a scratch register for you; do that with an `"=r"` dummy *output* operand! Maybe you should have posted to codereview.SE after all. (Still looking at the overall question, it's very long; this was just the first thing I noticed.) — Peter Cordes, Aug 16 '20 at 06:42
What's `.byte 0x0f, 0xb9, 0x3d` / `0x53053053`? Always comment any manual machine code so people don't have to look it up. — Peter Cordes, Aug 16 '20 at 06:44
There a comment in the rseq setup section. But basically there are invalid instructions so that ```rseq``` developers require before abort on ```x86-64``` as a canary. Basically they are magic numbers. — Noah, Aug 16 '20 at 06:46
There's no comment on it in the `#ifdef FAST_ABORT` part I was reading in the code block you told readers to JUMP TO. If you have a macro, why not use it there? — Peter Cordes, Aug 16 '20 at 06:49
Usually avoid `1UL << _tzcnt_u64(x)`. Instead, isolate the lowest set bit with [BMI1 `blsi`](https://www.felixcloutier.com/x86/blsi), aka `x & -x`, which Haswell/Skylake run as a single uop with 1c latency. Also, this is Linux kernel code so it's not worth considering SSE* / AVX, right? — Peter Cordes, Aug 16 '20 at 06:51
In the end I need the actual index however. Is there a faster way to get bit position if you know it is isolated? Also added comments in the ```FAST_ABORT``` def. — Noah, Aug 16 '20 at 06:53
But I could use [BLCS](https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set#TBM_(Trailing_Bit_Manipulation)) with ```temp_v2``` then ```tzcnt``` and skip the ```notq``` and ```btsq```Edit: Nevermind, ```BLCS``` is not supported by intel — Noah, Aug 16 '20 at 06:56
Oh, if you need the index anyway, then it might be just as good to use `bt*` instructions with the index, instead of isolating or clearing the lowest set bit in parallel with tzcnt. IDK if there's a bithack emulation for BLCS that's worth using; it would be at least 2 instructions, maybe more. BMI has clear or isolate, but not set a new bit. — Peter Cordes, Aug 16 '20 at 07:10
`lea` probably isn't worth it; I think you're only doing one pure `mov` load and optionally one pure `mov` store, so `lea` just costs you 1 front-end uop. Micro-fusion of indexed addressing modes works for the store on Haswell/Skylake. Unless you need the store-addrees uop to be able to run on port 7 to avoid some other bottleneck on ports 2 / 3? Unlikely. — Peter Cordes, Aug 16 '20 at 07:16
Before I definitely answer (about the ports) I will need to do some reading. But I will look into it. Thank you! After some quick reading I think your almost definetly right as unless there is a bug in rseq no other thread should be using ports 2 / 3 on the core (excluding hyperthreading). — Noah, Aug 16 '20 at 07:22
When you removed `lea`, you're now using 2 *more* instructions??? That's the opposite of what I meant: use addressing modes, not ALU instructions. Your code never stores the value `%[v2]` anywhere, you only use that as a store destination. Just use the `8(%[v1],%[idx_v1],8)` addressing mode in both load and store, like `movq 8(%[v1],%[idx_v1],8), %[temp_v2]`, and similar for the later store. An indexed load is still a single uop, with the same latency as `mov (%reg), %reg` (the 4c load-use latency fast path only applies when the address came directly from another load.) — Peter Cordes, Aug 16 '20 at 09:58
Thank you @PeterCordes. I've spent a fair amount of time optimizing the asm in my other functions using your helpful comments from here and my other posts (and fixing my use of variable modifiers). My code is now about 3 times faster with the critical section purring! (and i've learned a ton!) — Noah, Aug 18 '20 at 04:04
Cool, I didn't realize there was that much to gain here. The inline asm itself didn't look that bad. Or I guess you're talking about optimizing stuff beyond what's shown in this specific question. — Peter Cordes, Aug 18 '20 at 04:20
(fixing my register modifiers and inlining was fairly significant actually) but most of the improvement was on other functions. This function was kind of my sandbox for learning how to efficiently code in asm (and was somewhat putting together my other posts). The improvement came updating all my other functions with the information I learned. :) — Noah, Aug 18 '20 at 14:41
Somewhat unrelated: I'm a systems guy with about 99% of my experience in low level C/C++, so I have an idea of what assembly should look like but don't really know enough computer architecture to reason about it beyond what's explicitly stated in the manpages. Are there are light reading textbooks you would recommend that you would recommend that focus on reasonable about optimizing instructions? Or would you suggest just start with the basics and working from there. — Noah, Aug 18 '20 at 14:46
Definitely Agner Fog's asm optimization guide, and especially the Sandybridge and Skylake chapters in his microarch guide. https://agner.org/optimize/. Intel's optimization manual has some good stuff, but it's mixed with stuff from Pentium 4 era, so it's hard to be sure what's still relevant. And it's very long. Also useful, https://www.realworldtech.com/sandy-bridge/ for a look at CPU architecture from a POV less focused on optimizing software for it, more focused on how it runs existing code. It has diagrams where Agner Fog's guide doesn't, so it's a great starting point actually. — Peter Cordes, Aug 18 '20 at 17:08
Finished https://agner.org/optimize/microarchitecture.pdf and realworldtech from sandy-bridge / haswell :) — Noah, Sep 24 '20 at 04:02
@PeterCordes Also thought you might be interested but instead of using ```blsi``` I did find that ```tzcntq``` + ````blsrq``` is much faster than ```tzcntq``` + ```btrq``` — Noah, Sep 24 '20 at 04:05
@PeterCordes do you know of any resource that meaningfully enumerates the unanswered questions in reverse engineering intel's x86_64 implementation? I feel like at this point I have a reasonable grasp of everything on the x86 wiki / Agner's guides but feel is likely there are many things I'm missing just because not thinking to ask about them. — Noah, May 13 '21 at 19:36
No, I don't :/ Good question, I've sometimes come across something I hadn't realized was a problem that needed to get solved, and realized that my previous picture was incomplete. OTOH, from the perspective of optimizing software, some things Just Work and it doesn't really matter how Intel solved some internal detail when there aren't observable effects, or even if they're just really hard to observe. (e.g. some of Lewis Kelsey's answers get so deep in details gleaned / guessed from patents that I find it hard to tell which parts would make any meaningful difference. :/) — Peter Cordes, May 13 '21 at 19:44
@PeterCordes any links to some of Lewis' posts? Don't recall the name. If I had to weigh it from a perf perspective what frustrates me the most is lack of understanding how uops are scheduled but that already has [a pretty indepth Q/A](https://stackoverflow.com/questions/40681331/how-are-x86-uops-scheduled-exactly/40754426?noredirect=1#comment117816168_40754426), more details on the "gotchas" of the Uop Cache, and more information on the BPU/BTB because most of the information out there is pretty outdated. The Branch stuff just needs to be retested AFAICT but the other two seem totally unknown — Noah, May 13 '21 at 20:39
@PeterCordes what about you? What would your list of "wish we knew more about X" be? — Noah, May 13 '21 at 20:40
https://stackoverflow.com/search?q=user%3A7194773+%5Bcpu-architecture%5D+is%3Aanswer, e.g. [Globally Invisible load instructions](https://stackoverflow.com/a/65949599) looks like an example of what I was talking about, such low level detail that it's hard to connect it back to any performance effects, and lots of abbreviations that aren't defined, and no TL:DR or high-level picture. (Lewis, if you're reading this, I'm sure there's some valuable info in your answers, but without any summary / TL:DR of key points there's not a lot of enticement to start reading so much low-level detail.) — Peter Cordes, May 13 '21 at 22:05
My "wish we knew more" list includes how exactly the MS-ROM takes over the front end, and when later instructions can get their uops into the back-end after a `rep stos`. I think they just [take over the issue end of the IDQ](https://stackoverflow.com/questions/56218344/how-are-microcodes-executed-during-an-instruction-cycle) while MITE / DSB can potentially be refilling the IDQ from the other end. But there's some size threshold above which `rep movs` stops the CPU from finding ILP between two long imul dep chains, presumably if the initial burst of conditional uops wasn't enough. — Peter Cordes, May 13 '21 at 22:19
Also, is the L2TLB a victim cache? I don't know if I read it somewhere or just made it up. (Mostly matters for code+data in the same page, whether code-fetch primes L2TLB for data.) Also, does an AH merging uop really have to issue in a cycle by itself? I think so, but I don't think I conclusively tested that, or didn't write down the result. Similarly, does un-lamination ever reduce front-end throughput, e.g. if the last uop of an issue group needs to be un-laminated or some other condition? — Peter Cordes, May 13 '21 at 22:23
Another item on my list: what exactly stalls during a store-forwarding stall? Does the latency depend on what older stores are in flight? Can fast-path store-forwarding happen during the "stall"? I had previously assumed it was just extra latency, but https://easyperf.net/blog/2018/03/09/Store-forwarding#one-more-interesting-experiment seems to show that two stalled store-forwardings can't overlap their latency. — Peter Cordes, May 13 '21 at 23:19
@PeterCordes shot you an email (not sure if you check it a lot so pinging you here as well). — Noah, Jun 20 '21 at 17:50
Indeed I often ignore my email for a long time >.<. In this case, the mail also ended up in my spam folder - found it. — Peter Cordes, Jun 20 '21 at 18:04

optimizing percpu 2 level bit vector in restartable sequence on x86-64, skylake

0 Answers0

Linked