3

Changing t=t+j (outher loop variable) to t=t+9 inside two nested for loops increased runtime more than 10%. How this can be possible?

for(int j=-1048576;j<1048576;j+=2000)
  for(int i=-1048576;i<1048576;i++)
    t+=9;//t+=j

CPU:Celeron N3350

I look at assembly of both binaries. Only difference I notice:

t=t+9 :
    addl $9, -4(%rbp)

t=t+j :
    movl -8(%rbp), %eax
    addl %eax, -4(%rbp)
Drew Dormann
  • 59,987
  • 13
  • 123
  • 180
  • 9
    How do you time it? Sounds unreasonable. – Eugene Sh. Feb 11 '22 at 21:54
  • Did you say "Both versions"? – ctrl-alt-delor Feb 11 '22 at 21:59
  • 3
    Questionable results. Which compiler are you using? What flags/optimizations are you using? And what is the output on godbolt.org ? – Eyal Cinamon Feb 11 '22 at 22:00
  • Well, you posted the answer yourself. Adding the constant requires one instruction whereas the other version first references and loads a value to `eax` before adding it. – Stefan Falk Feb 11 '22 at 22:02
  • 2
    Benchmarking debug builds is normally useless; the bottleneck is store-forwarding latency which on current Intel can actually be lower if the reload attempt doesn't happen right away (because something like a reload delays it). [That Celeron](https://ark.intel.com/content/www/us/en/ark/products/95598/intel-celeron-processor-n3350-2m-cache-up-to-2-40-ghz.html) is an [Apollo Lake](https://en.wikichip.org/wiki/intel/cores/apollo_lake), which is actually low-power Goldmont, not Sandybridge-family, so it's interesting that it seems to have a similar effect to SnB. – Peter Cordes Feb 11 '22 at 22:02
  • 1
    I run these codes as a single program. It writes "Process exited after x seconds" after process end(under Windows). One takes 8.5-10 seconds and other takes 10.2-11.x seconds. I tried it for about 10 times each. – Alperen Temur Feb 11 '22 at 22:04
  • 2
    @StefanFalk: The one-instruction code is, reportedly, slower than the two-instruction code. – Eric Postpischil Feb 11 '22 at 22:04
  • 3
    @StefanFalk: You'd expect one memory-destination ADD (RMW) instruction to be faster than a load + a memory-destination ADD. There's an interesting CPU-microarchitecture quirk at the heart of why the asm with extra work runs faster than the other. Fortunately the question included a specific CPU model and disassembly so it's answerable (or identifiable as likely duplicate.) That's several steps removed from anything being generally true as far as C source code is concerned, though! – Peter Cordes Feb 11 '22 at 22:04
  • 1
    I was reading about seemingly random changes is speed, relating to alignment, that can mask other changes. The one extra instruction, could change alignment, and effect caches, memory pages, etc. The article was about a profiler that injects these random alignments to get better averages performance. I can't remember its name, but was from around 2010 ish. And may be a Free Software project. However did you check the speed of the two instructions. If of x86 they will not be the same. – ctrl-alt-delor Feb 11 '22 at 22:05
  • @EricPostpischil Sorry, my bad. I mixed that up! – Stefan Falk Feb 11 '22 at 22:05
  • 1
    @PeterCordes How did you understand that I'm using debug build? How can I learn more about effect of store-forwarding on performance? – Alperen Temur Feb 11 '22 at 22:10
  • 3
    Your compiler is keeping a local variable on the stack instead of a register inside a tight loop with no function calls. No optimizing compiler worthy of the name would be that bad except when [intentionally gimping itself for consistent debugging](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point). And it's referencing that stack space relative to RBP, i.e. using RBP as a frame pointer, which mainstream x86-64 compilers don't do when optimizing, except in functions with VLAs, or with `gcc -fno-omit-frame-pointer`. – Peter Cordes Feb 11 '22 at 22:18
  • 1
    You can learn more about store forwarding it by reading my answer on the linked duplicate; see the link at the top of the page. (And the links from those answers.) But really, you should just compile with optimization enabled, so the compiler can keep things in registers most of the time. – Peter Cordes Feb 11 '22 at 22:20
  • 1
    @ctrl-alt-delor Nice point. I suspect code aligntment can be issue. I think maybe I may test that by putting and another instuction in the place of mov that have same byte length as mov and can be issued simultaneously by add. I didn't understand your question as one program has one addl and other has an extra mov. – Alperen Temur Feb 11 '22 at 22:21
  • 1
    Yup, that would be a good experiment to rule out front-end and alignment issues. A 3-byte NOP like `.byte 0x66, 0x66, 0x90` would work. Or `.byte 0x0f, 0x1f, 0x00` to use a more classic long nop. – Peter Cordes Feb 11 '22 at 22:23
  • @PeterCordes I added these instruction by inline assembly. Final asseblies numbers next to line numbers are now matching with t=t+y case. But performance is same as t=t+9. So line alignment should't be the issue. – Alperen Temur Feb 11 '22 at 22:41
  • 1
    An optimizing compiler should be able to produce the value to add to `t` without looping, by simply multiplying 9 * the iteration count; this could be done at compile time, reducing that entire inner loop to a single add instruction. – Erik Eidt Feb 11 '22 at 22:54
  • I'm trying to create a program to measure instruction latencies (integer add in that case). So performance itself isn't an issue. If I compile this example with O3, inner loop is optimized. I better find a way to measure it while still using O3. – Alperen Temur Feb 11 '22 at 23:21
  • @AlperenTemur: I would highly not recommend using inline assembly for this. Just use `gcc -S foo.c` to get the asm output as `foo.s`, then edit it by hand before assembling+linking with `gcc foo.s`. That makes sure that having to accommodate the inline asm doesn't affect the code-gen at all. – Peter Cordes Feb 11 '22 at 23:27
  • If you want to measure instruction latencies, write microbenchmarks in asm!! Hoping to get a compiler to emit the instructions you want in a useful loop (without huge loop overhead) is just not going to work at all. See [RDTSCP in NASM always returns the same value (timing a single instruction)](https://stackoverflow.com/q/54621381), and look at existing open-source microbenchmark frameworks like nanobench (https://uops.info/) and https://github.com/travisdowns/uarch-bench – Peter Cordes Feb 11 '22 at 23:30
  • There is a weird effect with inefficient code on some microarchitectures that having back to back read/writes on the same location causes slower code than doing them "slower." – fuz Feb 11 '22 at 23:30
  • Your compiler is simply smart enough to completely replace the inner loop by `t += (1048576- (-1048576))*9`. I'm only surprised that it does not reduce the outer loop too, turning both of them into a single addition. – Sam Varshavchik Feb 12 '22 at 01:43

0 Answers0