Let's consider this very simple code
int main(void)
{
char buff[500];
int i;
for (i=0; i<500; i++)
{
(buff[i])++;
}
}
So, it just goes through 500 bytes and increments it. This code was compiled using gcc on x86-64 architecture and disassembled using objdump -D utility. Looking at the disassembled code, I found out that data are transferred from memory to register byte by byte (see, movzbl instruction is used to get data from memory and mov %dl is used to store data in memory)
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 48 81 ec 88 01 00 00 sub $0x188,%rsp
4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4004ff: eb 20 jmp 400521 <main+0x34>
400501: 8b 45 fc mov -0x4(%rbp),%eax
400504: 48 98 cltq
400506: 0f b6 84 05 00 fe ff movzbl -0x200(%rbp,%rax,1),%eax
40050d: ff
40050e: 8d 50 01 lea 0x1(%rax),%edx
400511: 8b 45 fc mov -0x4(%rbp),%eax
400514: 48 98 cltq
400516: 88 94 05 00 fe ff ff mov %dl,-0x200(%rbp,%rax,1)
40051d: 83 45 fc 01 addl $0x1,-0x4(%rbp)
400521: 81 7d fc f3 01 00 00 cmpl $0x1f3,-0x4(%rbp)
400528: 7e d7 jle 400501 <main+0x14>
40052a: c9 leaveq
40052b: c3 retq
40052c: 0f 1f 40 00 nopl 0x0(%rax)
Looks like it has some performance implications, because in that case you have to access memory 500 times to read and 500 times to store. I know that cache system will cope it somehow, but anyway. My question is why we can't load the quadwords and just do a couple of bit operations to increase each byte of that quadword and then push it back to memory? Obviously it would require some addition logic to deal with the last part of data that is less than quadword and some additional register.But this approach would dramatically reduce number of memory accessing that is the most expensive operation. Probably I don't see some obstacles that inhibit such optimization. So, it would be great to get some explanations here.