3

So I've been reading up about the what goes on inside x86 processors for about half a year now. So I decided to try my hand at x86 assembly for fun, starting only with 80386 instructions to keep it simple. (I'm trying to learn mostly, not optimize)

I have a game I made a few months ago coded in C, so I went there and rewrote the bitmap blitting function from scratch with assembly code. What I don't get is that the main pixel plotting body of the loop is faster with the C code (which is 18 instructions) than my assembly code (which is only 7 instructions, and I'm almost 100% certain it doesn't straddle cache line boundaries).

So my main question is why do 18 instructions take less time than the 7 instructions? At the bottom I have the 2 code snippets.

PS. Each color is 8 bit indexed. C Code:

    {
        for (x = 0; x < src.w; x++)
00D35712  mov         dword ptr [x],0                       // Just initial loop setup
00D35719  jmp         Renderer_DrawBitmap+174h (0D35724h)   // Just initial loop setup
00D3571B  mov         eax,dword ptr [x]  
00D3571E  add         eax,1  
00D35721  mov         dword ptr [x],eax  
00D35724  mov         eax,dword ptr [x]  
00D35727  cmp         eax,dword ptr [ebp-28h]  
00D3572A  jge         Renderer_DrawBitmap+1BCh (0D3576Ch)  
        {
                *dest_pixel = renderer_trans[renderer_light[*src_pixel][light]][*dest_pixel][trans];
// Start of what I consider the body
00D3572C  mov         eax,dword ptr [src_pixel]  
00D3572F  movzx       ecx,byte ptr [eax]  
00D35732  mov         edx,dword ptr [light]  
00D35735  movzx       eax,byte ptr renderer_light (0EDA650h)[edx+ecx*8]  
00D3573D  shl         eax,0Bh  
00D35740  mov         ecx,dword ptr [dest_pixel]  
00D35743  movzx       edx,byte ptr [ecx]  
00D35746  lea         eax,renderer_trans (0E5A650h)[eax+edx*8]  
00D3574D  mov         ecx,dword ptr [dest_pixel]  
00D35750  mov         edx,dword ptr [trans]  
00D35753  mov         al,byte ptr [eax+edx]  
00D35756  mov         byte ptr [ecx],al  
            dest_pixel++;
00D35758  mov         eax,dword ptr [dest_pixel]  
00D3575B  add         eax,1  
00D3575E  mov         dword ptr [dest_pixel],eax  
            src_pixel++;
00D35761  mov         eax,dword ptr [src_pixel]  
00D35764  add         eax,1  
00D35767  mov         dword ptr [src_pixel],eax  
// End of what I consider the body
        }
00D3576A  jmp         Renderer_DrawBitmap+16Bh (0D3571Bh)  

And the assembly code I wrote: (esi is the source pixel, edi is the screen buffer, edx is the light level, ebx is the transparency level, and ecx is the width of this row)

drawing_loop:
00C55682  movzx       ax,byte ptr [esi]  
00C55686  mov         ah,byte ptr renderer_light (0DFA650h)[edx+eax*8]  
00C5568D  mov         al,byte ptr [edi]  
00C5568F  mov         al,byte ptr renderer_trans (0D7A650h)[ebx+eax*8]  
00C55696  mov         byte ptr [edi],al  

00C55698  inc         esi  
00C55699  inc         edi  
00C5569A  loop        drawing_loop (0C55682h)  
// This isn't just the body this is the full row plotting loop just like the code above there

And for context, the pixel is lighted with a LUT and the transparency is done also with a LUT. Pseudo C code:

//transparencyLUT[new][old][transparency level (0 = opaque, 7 = full transparency)]
//lightLUT[color][light level (0 = white, 3 = no change, 7 = full black)]
dest_pixel = transparencyLUT[lightLUT[source_pixel][light]]
                            [screen_pixel]
                            [transparency];

What gets me is how I use pretty much the same instructions the C code does, but just less of them?

If you need more info I'll be happy to give more, I just don't want this to be a huge question. I'm just genuinely curious because I'm sorta new to x86 assembly programming and want to learn more about how our cpus actually work.

My only guess is that the out of order execution engine doesn't like my code because its all memory accesses moving to the same register.

Napoleon
  • 31
  • 2
  • 2
    A few things. 1) Youre version is going to suffer serious from [partial register stalls](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers). 2) Instructions are only an indirect way of estimating performance. They only matter in how they affect other things, such as the frontend/decoder (so instruction size/alignment), or backend (so uops, latency, throughput). If youre going to start seriously look into micro optimization you might checkout the [x86 wiki on SO](https://stackoverflow.com/tags/x86/info). PeterCordes has done an excellent job maintaining it. – Noah Sep 18 '21 at 21:33
  • A couple of similar questions that partially address this. One for older [Pentium](https://stackoverflow.com/questions/1099225/why-do-more-pentium-assembly-instructions-take-less-time) CPUs, one [newer](https://stackoverflow.com/questions/58884297/why-does-re-initializing-a-register-inside-an-unrolled-add-loop-make-it-run-fast) CPUs. – 1201ProgramAlarm Sep 18 '21 at 21:34
  • Also, without benchmarks / numbers it tough to debug and performance issues. Frankly I don't really know what I'm looking at in the first code block. A bit more clarity about what your comparing would help. – Noah Sep 18 '21 at 21:38
  • As vonbrand notes, reusing the same register won't bother OOE that much, because of register renaming (provided you overwrite it completely - which you don't hence the partial register stalls). But generally, instruction count is not a good estimate of execution time. It's more things like: how many uops does each instruction actually take? Which of these instructions can OOE do in parallel? Where will it stall? Will memory accesses hit cache? Will conditional branches be predicted accurately? – Nate Eldredge Sep 18 '21 at 22:02
  • 2
    Another note is that the `loop` instruction is remarkably slow, see https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently – Nate Eldredge Sep 18 '21 at 22:06
  • Thank you guys for being so helpful on my first question! If I ask another one I'll show more code for context. And I'll look into those other links you guys have. – Napoleon Sep 18 '21 at 22:40
  • 2
    `movzx ax,byte ptr [esi]` ouch, you almost avoided a false dependency on the previous iteration with `movzx eax, byte ptr [esi]`, but instead you only merged a new low-16 into the old EAX. See also [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139) if you're running this on a modern Intel CPU; AH-merging still require a merging uop, and it seems to have to issue in a cycle by itself. – Peter Cordes Sep 19 '21 at 03:51
  • Yeah, just tons of unnecessary partial-register false dependencies / stalls from not using movzx, or not using a wide enough movzx. Is there some non-zero high part of EAX that you're merging into, to index a 256-byte or 64k aligned array without extra address calculation? – Peter Cordes Sep 19 '21 at 03:54
  • Note that if you're running low on registers, you could spill some regs that aren't used during this loop, freeing them up for temporaries. Although I don't think you need that, just `movzx` into 32-bit registers. And depending on your CPU, try `movzx eax` / `shl eax, 8` vs. mov into AH. Using `mov al, [mem]` to merge a low byte is fine on most CPUs (except old P6-family up to Nehalem) when you actually want that merging; that's something compilers are actually missing out on. (Or avoiding because it's a big perf pothole on still-existing Core2 / Nehalem CPUs). – Peter Cordes Sep 19 '21 at 04:06
  • Also, if you can just compute instead of look-up, you may gain some speed if you can compute 16 pixels in parallel using SSE2, or even just 4 if you need to unpack to float. IDK if the L2 cache misses from your lookup tables are hurting much (or would be if you weren't creating this false dependency), or if OoO exec can pipeline them. – Peter Cordes Sep 19 '21 at 04:08
  • [How to load a single byte from address in assembly](https://stackoverflow.com/q/20727379) is a better duplicate for the loading-a-byte specific detail, added that. Now I think the duplicates really do cover the two showstoppers pretty specifically. – Peter Cordes Sep 19 '21 at 23:03

1 Answers1

1

Not all instructions take the same time, modern implementations of the CPU can execute (parts of) some instructions in parallel (as long as one doesn't read the data written by the previous one and the required units don't collide). Latest versions do translate the "machine" instructions into lower level, very simple ones, that are scheduled on the fly to be executed on the various units in the CPU in parallel as much as possible, using a whole lot of shadow registers (i.e., one instruction can be using the value in one copy of %eax (old value) after another instruction writes a new value into another copy of %eax (new value), thus decoupling instructions even more. The hoops they jump through for performance's sake...

vonbrand
  • 11,412
  • 8
  • 32
  • 52
  • I suspected it had to do with the OOE stuff but I wasn't sure until now. So I did what you said, removed and dependency in there. I was out of registers but I used ebp for something else other than the stack. Now the last 2 lines before the ```inc esi inc edi``` use edx instead of the same eax. – Napoleon Sep 18 '21 at 22:09
  • 1
    Note that only few instructions are translated into more than one µop (micro operation). Most important instructions actually correspond to exactly one µop, even somewhat complex ones. And these µops are anything but simple as they hold an entire execution port configuration. – fuz Sep 19 '21 at 01:34
  • When you say "latest versions", that's pretty much every x86 CPU since ~1999, when Intel stopped making P5 / P5MMX which was superscalar in-order pipelined, but didn't split complex instructions into separate uops (so P5 was most efficient running a RISCier subset of x86, e.g. avoiding memory-destination add.) Literally all commercially available x86 CPUs, including low-power / embedded, have been out-of-order for a few years now, with mainstream CPUs starting to decode to uops as early as PPro (P6) in 1995. – Peter Cordes Sep 19 '21 at 23:02