2

Basically I have a struct with the definition

#define BATCH_SIZE 8
#define BATCH_SIZE_LOG 3
//#define BATCH_MASK 0x7070707070707070

// for the sake of understanding the ASM turn this into a no-op
#define BATCH_MASK (~(0UL))
struct batcher {
    uint8_t indexes[8];
    uint64_t vals[8 * BATCH_SIZE];

    uint32_t __attribute__((noinline))
    push(const uint64_t i, const uint64_t v) {
        if(__builtin_expect(indexes[i] < (BATCH_SIZE - 1), 1)) {
            vals[8 * i + indexes[i]++] = v; 
            return 0;  
        }
        return 1;
    }

    uint32_t __attribute__((noinline))
    claim(const uint64_t i) {
        if(__builtin_expect(indexes[i] == (BATCH_SIZE - 1), 1)) {
            indexes[i] = 8;
            return 0;
        }
        return 1;
    }

    uint32_t 
    can_pop() const {
        if(*((uint64_t *)(&indexes[0])) & BATCH_MASK) {
            return 1;
        }
        return 0;
    }

    uint64_t __attribute__((noinline))
    pop() {
        if(__builtin_expect(can_pop(), 1)) {
            const uint32_t idx = _tzcnt_u64(*((uint64_t *)(&indexes[0])) & BATCH_MASK) >> BATCH_SIZE;
            return vals[8 * idx + --indexes[idx]]; 
        }
        return 0;
    }
};

What I am curious about is whether pop can be implemented with only 1 memory load from indexes (so 2 total, 1 from indexes and 1 from vals)

The first memory load is to interpret all of indexes as a uint64_t so that I can check if it is non 0, and if so use one of those indexes.

I have been looking at the assembly output here

which has implements pop with

batcher::pop():
        movq    (%rdi), %rax  // first load from indexes
        testq   %rax, %rax
        jne     .L11
        ret
.L11:
        xorl    %edx, %edx
        movzbl  (%rdi,%rdx), %eax // second load from indexes
        decl    %eax
        movb    %al, (%rdi,%rdx)
        movzbl  %al, %eax
        movq    8(%rdi,%rax,8), %rax
        ret

The way the compiler implements this is one from from %(rdi) to %rax for the interpretation as a uint64_t (testing if there are any non 0 indexes) and a second load if the condition passes loading the calculated uint8_t index.

I am wondering if there is a way to implement pop in assembly (what I will be doing) without two loads. I know I could accomplish the same logic with shifting / masking on the result from the first load. What I am specifically wondering is if there is a way for me to index into the uint64_t resulting from the first load as if it where is uint8_t[8] array.

My guess is that this is not possibly because register don't have a memory address so it doesn't fully make sense to be able to do that but I might be missing some instruction made specifically for isolating bytes in a uint64_t or some way that the assembly implementation of pop could be refactored to enable this.

Note: I am limited to the instruction sets available on Intel Skylake.

If anyone has any ideas I would appreciate them. Thank you!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Noah
  • 1,647
  • 1
  • 9
  • 18
  • 1
    Unrelated, but `((uint64_t *)(&indexes[0])` is a strict aliasing violation (UB). – rustyx Aug 22 '20 at 19:32
  • Si. Since I will be the one implementing this in assembly however the optimizer won't need to worry :) – Noah Aug 22 '20 at 19:37

1 Answers1

3

Probably tzcnt, round down that count to a multiple of 8 bits, and right shift (with BMI2 shrx so it's a single uop). Then the non-zero byte is in the bottom of the register where you can movzbl zero-extend it into any other reg (not the same one, that would defeat mov-elimination)

    tzcnt  %rax, %rcx          # input in RAX
    and    $-8, %ecx           # 0xff...f8
    shrx   %rcx, %rax, %rdx    # rdx = rax >> cl
    movzbl %dl,  %eax          # zero latency between separate registers

(If all-zero is possible, test / jz if you need to detect that case, or just let the shift happen. A qword shift by 64 leaves value unchanged so the result will be 0.)

You can do this with intrinsics like _tzcnt_u64; there is no apparent benefit in using inline asm for this. You can do unaligned strict-aliasing-safe qword loads with GNU C
typedef uint64_t aliasing_u64 __attribute__((aligned(1), may_alias)).


With only 8 bytes, it would be overkill do to the usual SIMD of pcmpeqb / pmovmskb / tzcnt on the movemask result to find the byte-position. (And then integer movzbl load that byte from memory, using the byte offset).

rustyx
  • 80,671
  • 25
  • 200
  • 267
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Thanks! Out of curiousity: why not ```shrx %rcx, %rax, %rax``` then ```"movzbl %al, %eax```? Also the reason I'm doing inline asm for everything is that its necessary for ```rseq``` critical sections. Otherwise I generally feel the compiler is smarter than me – Noah Aug 22 '20 at 20:05
  • 2
    @Noah: because that would defeat mov elimination, costing 1 extra cycle of latency on the critical path. (And a back-end uop, but that's often irrelevant; the front-end is narrower). See [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342). You could `shrx %rcx, %rax, %rcx` / `movzbl %cl, %eax` if you didn't want to keep the position around. – Peter Cordes Aug 22 '20 at 20:08