0

I'm pretty new to performance measurement. I came across this question and decided to check it myself. Here is how my benchmarks look like:

For stack:

section .text
    global _start

_start:
    mov r12, 0xFFFFFFFF
    push 0xFFFFFF
    mov_loop:
        mov rax, [rsp]
        dec r12
        jnz mov_loop


    mov rax, 60
    syscall

For heap:

SYS_brk equ 0x0C

section .text
    global _start

_start:
    mov rax, SYS_brk
    mov rdi, 0
    syscall

    ;allocate 8 bytes
    mov r10, rax
    mov rax, SYS_brk
    mov rdi, r10
    add rdi, 0x08
    syscall

    mov [r10], dword 0xFFFFFF
    mov rcx, 0xFFFFFFFF
    heap_loop:
        mov rax, [r10]
        dec rcx
        jnz heap_loop

    ;release memory
    mov rax, SYS_brk
    mov rdi, r10
    syscall

    mov rax, 60
    syscall

Runnning benchmarks with perf stat -d -r 10 showed that I actually measured L1-cache-loads in both of the cases.

 4,295,747,868      L1-dcache-loads           # 2996.483 M/sec                    ( +-  0.00% )
        48,316      L1-dcache-load-misses     #    0.00% of all L1-dcache hits    ( +- 18.42% )

Is there a way to invalidate cache lines before each iteration started?

St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • 2
    Do you mind *flushing* them instead? `clflush[opt]` will do. I believe there already are a few duplicates about this. – Margaret Bloom Feb 16 '18 at 14:07
  • @MargaretBloom I want to perform reading from the memory on each iteration. Is it flushing? – St.Antario Feb 16 '18 at 14:21
  • I believe `clflush` will do. It invalidates a line by writing it back to memory if needed. – Margaret Bloom Feb 16 '18 at 14:24
  • @MargaretBloom: Yes, the OP wants `clflushopt` (or `clflush`). flush = write-back (if dirty) + invalidate. There's no way (in user-space) to invalidate without write-back, and the OP probably didn't mean that anyway. And this benchmark is read-only anyway. – Peter Cordes Feb 16 '18 at 14:42
  • @St.Antario: See [What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?](https://stackoverflow.com/questions/46087730/what-happens-if-you-use-the-32-bit-int-0x80-linux-abi-in-64-bit-code). I think the break is in the low 32 bits in a normal position-dependent Linux executable, so your code works. But IDK why you're dynamically allocating it at all, and if so why with `brk` instead of `mmap(MAP_ANONYMOUS)` so you can easily choose hugepages or not with mmap flags. And if you don't care about that, why not just use a static array or the stack? Oh, you're testing stack vs. heap – Peter Cordes Feb 16 '18 at 14:43
  • You seem to be loading the same address every time in your heap loop. That's obviously only going to test L1D cache / load-port throughput, not any difference in servicing page faults or page-walking (e.g. maybe one uses anonymous hugepages and the other doesn't.) – Peter Cordes Feb 16 '18 at 14:48
  • @PeterCordes Yes, I could use static array. And that what I did for the first time. But the question was about comparing read from `stack` and `heap`. – St.Antario Feb 16 '18 at 14:53
  • @PeterCordes _I think the break is in the low 32 bits in a normal position-dependent Linux executable, so your code works._ I don't quite understand what you meant by that and how it tied to 32-bit ABI. I used 64-bit ABI as specified. What's wrong? – St.Antario Feb 16 '18 at 14:54
  • 2
    nvm, I saw a `0x08` while glancing over your code and thought I saw `int 0x80`. You are using `syscall` (correctly I assume). Clearly I should just go to bed instead of skimming SO while watching Olympic short-track speed skating, sorry for the confusion. – Peter Cordes Feb 16 '18 at 15:26

0 Answers0