0

Let's consider this very simple code

int main(void)
{
    char buff[500];
    int i;
    for (i=0; i<500; i++)
    {
        (buff[i])++;
    }   
}

So, it just goes through 500 bytes and increments it. This code was compiled using gcc on x86-64 architecture and disassembled using objdump -D utility. Looking at the disassembled code, I found out that data are transferred from memory to register byte by byte (see, movzbl instruction is used to get data from memory and mov %dl is used to store data in memory)

00000000004004ed <main>:
  4004ed:       55                      push   %rbp
  4004ee:       48 89 e5                mov    %rsp,%rbp
  4004f1:       48 81 ec 88 01 00 00    sub    $0x188,%rsp
  4004f8:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  4004ff:       eb 20                   jmp    400521 <main+0x34>
  400501:       8b 45 fc                mov    -0x4(%rbp),%eax
  400504:       48 98                   cltq   
  400506:       0f b6 84 05 00 fe ff    movzbl -0x200(%rbp,%rax,1),%eax
  40050d:       ff 
  40050e:       8d 50 01                lea    0x1(%rax),%edx
  400511:       8b 45 fc                mov    -0x4(%rbp),%eax
  400514:       48 98                   cltq   
  400516:       88 94 05 00 fe ff ff    mov    %dl,-0x200(%rbp,%rax,1)
  40051d:       83 45 fc 01             addl   $0x1,-0x4(%rbp)
  400521:       81 7d fc f3 01 00 00    cmpl   $0x1f3,-0x4(%rbp)
  400528:       7e d7                   jle    400501 <main+0x14>
  40052a:       c9                      leaveq 
  40052b:       c3                      retq   
  40052c:       0f 1f 40 00             nopl   0x0(%rax)

Looks like it has some performance implications, because in that case you have to access memory 500 times to read and 500 times to store. I know that cache system will cope it somehow, but anyway. My question is why we can't load the quadwords and just do a couple of bit operations to increase each byte of that quadword and then push it back to memory? Obviously it would require some addition logic to deal with the last part of data that is less than quadword and some additional register.But this approach would dramatically reduce number of memory accessing that is the most expensive operation. Probably I don't see some obstacles that inhibit such optimization. So, it would be great to get some explanations here.

Ryan B.
  • 1,270
  • 10
  • 24
  • The cache *will* handle the loads and stores, probably in 64-byte blocks. – Bo Persson Mar 13 '18 at 00:06
  • Yes, I know that after first reading the whole cache row will be fulfilled and the rest of accessing will be addressed by cache sub-system. So, do you mean to say that additional operations that will be required to deal with bytes of quadword will cost more than access to cache? – Andrew Bolotov Mar 13 '18 at 00:12
  • If you load 8 bytes into `rdx` there is no easy way to increment the bytes individually without risking overflow to the next byte. And anyway, the time it takes to really read and write to RAM will let the CPU perform hundreds of instructions while "waiting". – Bo Persson Mar 13 '18 at 00:19
  • I'm curious what you think of the potential performance problems of this if you actually compiled it as if performance *matters*. I.e. turning on optimizations `-O2` – WhozCraig Mar 13 '18 at 00:24
  • 1
    You might want to look at what gets generated with optimizations on: https://godbolt.org/g/QWvV7S . You can see the compiler used SIMD instructions to perform the operations. If you alter the length of the array from 500 to smaller numbers you can see how the code changes depending on the array length. The code was done this way to allow the code to be generated without optimizing the entire thing away – Michael Petch Mar 13 '18 at 01:17
  • 1
    *But this approach would dramatically reduce number of memory accessing that is the most expensive operation*: No, touching a cache line that you've already touched a few instructions earlier is cheap. It will still be hot in L1d cache. What's expensive are cache misses, or scattered memory access patterns. The asm you got from `-O0` is obviously horrible, and bottlenecks at one increment per ~6 cycles because it keeps the *loop counter* in memory. See https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization/ – Peter Cordes Mar 13 '18 at 01:46

2 Answers2

1

Reason why this shouldn't be done: Imagine if char happened to be unsigned (to make overflow have defined behavior) and you had a byte 0xFF followed (or preceded, depending on endianness) by 0x1.

Incrementing a byte at a time, you'd end up with the 0xFF becoming 0x00 and the 0x01 becoming 0x02. But if you just loaded 4 or 8 bytes at a time and added 0x01010101 (or eight byte equivalent) to achieve the same result, the 0xFF would overflow into the 0x01, so you'd end up with 0x00 and 0x03, not 0x00 and 0x02.

Similar issues would typically occur with signed char too; signed overflow and truncation rules (or lack thereof) make it more complicated, but the gist is that incrementing a byte at a time limits effects to that byte, with no cross-byte "interference".

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Thanks for explanation! But originally I thogught not just about adding 0x01010101 to loaded doubleword. Let’s say that in eax you have 4 bytes that should be modified. I can put into ebx mask like 0x000000FF which can be shifted to 8 bits to access each particular byte in eax. Like edx = eax & еbx; ebx<<8. Then I can do whatever I want with byte that is stored now in edx. To push back modified byte I can use inverted mask like eax= (eax & 0xFFFFFF00) + edx – Andrew Bolotov Mar 13 '18 at 00:29
  • 1
    I'll note, on x86 chips, `PADDB` in might actually allow for this sort of wrap-around without carry behavior; some compilers might choose to use it for all I know, but it's unlikely to gain much given it's likely to end up memory bound anyway. – ShadowRanger Mar 13 '18 at 00:30
  • @AndrewBolotov: Given CPU caches, replacing a few byte move instructions with a single quadword move and a bunch of additional bit manipulations is unlikely to gain much of anything. – ShadowRanger Mar 13 '18 at 00:32
  • @ShadowRanger: Separate byte-increment will not be memory-bound even if the data was cold in cache to start with. On a modern desktop, loading 8 bytes per core clock cycle is not unreasonable for sequential access. (And read/modify/write of 4 bytes per clock is not unreasonable either.) With a small array like this (only 500 bytes) that fits in L1d cache, `paddb` should give nearly a factor of 16 speedup. And compiling with optimization enabled instead of braindead `gcc -O0` code would give another factor of ~6 speedup. The store/reload of the loop counter is the bottleneck! – Peter Cordes Mar 13 '18 at 01:34
  • For a target architecture without SIMD byte addition, a good compiler might be able to use [SWAR (SIMD within a register](https://en.wikipedia.org/wiki/SWAR). Maybe load a (32-bit) word and mask it with `0xFF00FF00` and the inverse, add `0x01010101` to both, mask again to zero the carry-out, and OR them back together. (Maybe a better algo is possible, but generating the carry-out from each byte is hard). That may only be a win with 64-bit words, or if ALU throughput is very good compared to store throughput. (more lopsided than modern x86 with ~4 ALU ops per cycle vs. 1 store per cycle). – Peter Cordes Mar 13 '18 at 01:42
1

When you compile without optimization, the compiler does a more literal translation of code to assembly, part of the reason for this is so that when you step through the code in a debugger, the steps correspond to your code.

If you enable optimization then the assembly may look completely different.

Also, your program causes undefined behaviour by reading an uninitialized char.

M.M
  • 138,810
  • 21
  • 208
  • 365