Basically I have a struct with the definition
#define BATCH_SIZE 8
#define BATCH_SIZE_LOG 3
//#define BATCH_MASK 0x7070707070707070
// for the sake of understanding the ASM turn this into a no-op
#define BATCH_MASK (~(0UL))
struct batcher {
uint8_t indexes[8];
uint64_t vals[8 * BATCH_SIZE];
uint32_t __attribute__((noinline))
push(const uint64_t i, const uint64_t v) {
if(__builtin_expect(indexes[i] < (BATCH_SIZE - 1), 1)) {
vals[8 * i + indexes[i]++] = v;
return 0;
}
return 1;
}
uint32_t __attribute__((noinline))
claim(const uint64_t i) {
if(__builtin_expect(indexes[i] == (BATCH_SIZE - 1), 1)) {
indexes[i] = 8;
return 0;
}
return 1;
}
uint32_t
can_pop() const {
if(*((uint64_t *)(&indexes[0])) & BATCH_MASK) {
return 1;
}
return 0;
}
uint64_t __attribute__((noinline))
pop() {
if(__builtin_expect(can_pop(), 1)) {
const uint32_t idx = _tzcnt_u64(*((uint64_t *)(&indexes[0])) & BATCH_MASK) >> BATCH_SIZE;
return vals[8 * idx + --indexes[idx]];
}
return 0;
}
};
What I am curious about is whether pop
can be implemented with only 1 memory load from indexes
(so 2 total, 1 from indexes
and 1 from vals
)
The first memory load is to interpret all of indexes
as a uint64_t
so that I can check if it is non 0, and if so use one of those indexes.
I have been looking at the assembly output here
which has implements pop
with
batcher::pop():
movq (%rdi), %rax // first load from indexes
testq %rax, %rax
jne .L11
ret
.L11:
xorl %edx, %edx
movzbl (%rdi,%rdx), %eax // second load from indexes
decl %eax
movb %al, (%rdi,%rdx)
movzbl %al, %eax
movq 8(%rdi,%rax,8), %rax
ret
The way the compiler implements this is one from from %(rdi)
to %rax
for the interpretation as a uint64_t
(testing if there are any non 0 indexes) and a second load if the condition passes loading the calculated uint8_t
index.
I am wondering if there is a way to implement pop
in assembly (what I will be doing) without two loads. I know I could accomplish the same logic with shifting / masking on the result from the first load. What I am specifically wondering is if there is a way for me to index into the uint64_t
resulting from the first load as if it where is uint8_t[8]
array.
My guess is that this is not possibly because register don't have a memory address so it doesn't fully make sense to be able to do that but I might be missing some instruction made specifically for isolating bytes in a uint64_t
or some way that the assembly implementation of pop
could be refactored to enable this.
Note: I am limited to the instruction sets available on Intel Skylake.
If anyone has any ideas I would appreciate them. Thank you!