2

I am currently working on an assembly function that sets a buffer to zero. I am measuring the clock cycles it takes to execute the function. However, I have encountered an issue where the number of clock cycles remains the same regardless of the increasing buffer size, and I'm unable to explain this behavior.

Here's the assembly function I'm using:

_set0:
set0:
    movq    $0, (%rdi)
    movq    $0, 8(%rdi)
    movq    $0, 16(%rdi)
    movq    $0, 24(%rdi)
    movq    $0, 32(%rdi)
    movq    $0, 40(%rdi)
    ret

I expected that as I increase the number of movq instructions, representing the buffer size, the number of clock cycles required to execute the function would increase proportionally. However, when I modify the function as follows:

_set0:
set0:
    movq    $0, (%rdi)
    movq    $0, 8(%rdi)
    movq    $0, 16(%rdi)
    movq    $0, 24(%rdi)
    movq    $0, 32(%rdi)
    movq    $0, 40(%rdi)
    movq    $0, 48(%rdi)
    movq    $0, 56(%rdi)
    movq    $0, 64(%rdi)
    movq    $0, 72(%rdi)
    movq    $0, 80(%rdi)
    movq    $0, 88(%rdi)
    ret

The number of clock cycles measured remains the same, despite the increased buffer size.

I would appreciate any insights or suggestions as to why the clock cycles measurement is not increasing as expected with the buffer size.

To measure clock cycles, I'm calling this function from a C file and I have this:

static inline uint64_t cpucycles(void) {
    uint64_t result;

    __asm__ volatile("rdtsc; shlq $32,%%rdx; orq %%rdx,%%rax" : "=a"(result) : : "%rdx");

    return result;
}

and then I take the median like this:

static uint64_t cpucycles_median(uint64_t *cycles, size_t timings) {
    for (size_t i = 0; i < timings - 1; i++) {
        cycles[i] = cycles[i + 1] - cycles[i];
    }

    return median(cycles, timings - 1);
}

For computing the number of cycles it takes to run the function, I'm running the function 1000 times and taking the median of the cycles it took to run each time.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
Z123
  • 31
  • 2
  • 2
    That few stores are faster than `rdtsc` throughput, and you're not doing anything to stop out-of-order exec of your code wrt. `rdtsc`. See also [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) for many details about RDTSC. (And also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for the point that this code is faster than timing overhead so is hard to measure; you'd normally want to put it in a loop and time the whole loop.) – Peter Cordes Jun 07 '23 at 23:52
  • 2
    IDK what effects you're looking for, like front-end decode of those bulky instructions if they miss in the uop cache, or just a throughput bottleneck of 1/clock stores on Intel before Ice Lake. To zero more than a couple qwords, you'd want `xorps %xmm0,%xmm0` and `movups` stores, unless you're in kernel code that can only touch integer regs. Very small blocks of code [don't have a single cost number in cycles](https://stackoverflow.com/q/51607391); their cost as part of the surrounding code depends on what it bottlenecks on. – Peter Cordes Jun 07 '23 at 23:55

0 Answers0