How to quickly erase huge block of video RAM in 32-bit assembly

Question

I am trying to make my own simple 3D rendering engine in assembly totally from scratch. So I run it from DOS, switch to 32-bit PM and all that stuff....

Finally I got working transformations with projection and wire-frame rendering but I got really trivial problem. After rendering my scene the LFB needs to be cleared so next frame I can draw there again.

But using rep stosd or simple mov-loop cycle is really slow and my FPS drops literally to 10 from 60+.

I am using high resolution 1280x1024 pixels with 4 bytes per pixel so I need to set 1280*1024=1310720 dwords = 5242880 bytes to zero starting at address 0xFC000000.

Is there any way to tell memory to erase itself instantly? (I want to keep that resolution)

You can find a comparison of different approaches [in this answer](https://stackoverflow.com/a/27944531/1305969). — zx485, May 15 '19 at 14:42
Note that a LFB residing on your graphics card but mapped into CPU memory address space may be significantly slower than host memory. Clearing it would normally be done locally on the GPU which is not the approach you are taking. — Jester, May 15 '19 at 14:45
@ Jester so are you trying to tell me that it is possible only with GPU? — Segy, May 15 '19 at 14:55
If you have a VBE/AF driver for your card you could try using a rectangle fill. Otherwise maybe just reinitalizing the same graphics mode could be an option. — Jester, May 15 '19 at 15:05
I am running in PM where I am not able to use interrupts. No I dont have VBE/AF driver but I will have a look on it. — Segy, May 15 '19 at 15:11
VBE does provide protected mode interface (but it's known to be buggy on actual hardware). — Jester, May 15 '19 at 15:17
Is your video memory mapped WC (write-combining) or UC (uncacheable)? If it's WC, then `movnti` should allow streaming stores that do a whole burst transfer of 64 bytes over the PCIe bus (if you have an external GPU). Or if you can use XMM registers, `movntps` for 16 bytes per instruction. I thought `rep stos` would also be efficient on WC memory though, and you say that's slow. So maybe your video RAM is set as uncacheable, or I'm wrong about `rep stosd`. — Peter Cordes, May 15 '19 at 15:22
there is no single answer that works everywhere, multiple factors, not just the video card, are involved. — old_timer, May 15 '19 at 16:54
There's probably something else going on here. Even if you're using an ancient PCI video card you should be getting around 25 fps just erasing the screen using REP STOSD, and about 17 fps with combined erasing and rendering, assuming rendering a frame takes 17ms (60 fps). — Ross Ridge, May 15 '19 at 17:07
re: your specific question: no, you can't tell RAM to erase itself, whether it's video RAM or normal CPU-connected RAM. Some CPUs have a special instruction to zero a full cache line, e.g. PowerPC does, but not standard x86. AMD has a `CLZERO` x86 instruction. But that's just 64 bytes at a time, and won't really speed things up vs. using `movnt` stores. To go much faster, you need to tell the GPU to zero video memory, especially if you have a discrete graphics card (not an iGPU sharing the same DRAM as the CPU cores). So you might have to write GPU drivers instead of just storing to mem. — Peter Cordes, May 16 '19 at 11:20

0x777C · Answer 1 · 2019-05-16T12:29:35.017

0

If you have only SSE available you can use pxor with movntps to do 16 bytes at a time. If you have SSE2 available you can do 16 bytes at a time with pxor and movdqu (or the faster movdqa if aligned to 16 bytes). If AVX512 instructions are available you can also use vpxor with an xmm register and movntps with the corresponding zmm registers to do 64 bytes.

If you want to use SSE and/or AVX512 instructions you'll need to set some control registers:

mov eax, cr0
and ax, 0xFFFB      ;clear coprocessor emulation CR0.EM
or ax, 0x2          ;set coprocessor monitoring  CR0.MP
mov cr0, eax
mov eax, cr4
or ax, 3 << 9       ;set CR4.OSFXSR and CR4.OSXMMEXCPT at the same time
mov cr4, eax

edited May 16 '19 at 12:29

answered May 16 '19 at 11:24

0x777C

993
7
21

You're claiming that a regular loop using `mov [edi], eax` is faster than `rep stosd`? That's obviously not true for big arrays in write-back memory, and doubtful for video RAM. Intel since PPro (P6 uarch) has had "fast strings" microcode for `rep stos` and `rep movs`, internally using wider stores. If you can't use XMM registers (e.g. in kernel code), `rep movs` is typically the best choice on modern CPUs, especially Intel since IvyBridge which added the ERMSB feature. [Enhanced REP MOVSB for memcpy](//stackoverflow.com/q/43343231) – Peter Cordes May 16 '19 at 11:36
Example: the Linux kernel [uses `rep movsd`](https://elixir.bootlin.com/linux/v5.1.2/source/arch/x86/include/asm/string_32.h#L33) for its internal 32-bit memcpy implementation in `string_32.h`. If a loop was faster, it would use that. – Peter Cordes May 16 '19 at 11:40
`vzeroall` doesn't take an operand: it zeros *all* the xmm/ymm/zmm registers. For this purpose, it would be more efficient to `vpxor xmm0, xmm0, xmm0` to zero just X/Y/ZMM0. Also, SSE2 goes 128 *bits* at a time, not bytes. – Peter Cordes May 16 '19 at 11:42
I've made some corrections and added some clarification. @PeterCordes I've worked alot with rep, it simply isn't fast, modern processors don't bother optimising for it because it's not something compilers really generate anymore. – 0x777C May 16 '19 at 11:45
`repne scasb` and `repe cmpsb` are [(very) slow](https://stackoverflow.com/questions/55563598/why-is-this-code-6-5x-slower), but `rep movs` and `rep stos` are fast (for large aligned blocks). They have significant startup overhead (until IceLake adds the "fast short rep" feature), but are far better than 32-bit `mov` for large memset or memcpy. Intel's optimization manual discusses the tradeoffs. You're completely wrong about processors not bothering to optimize for them: IvyBridge added the ERMSB feature and like I said IceLake will add a "fast short rep" feature. – Peter Cordes May 16 '19 at 11:49
re: your update: `rep stosd` is usually worse than a well-optimized SIMD loop, but it's much *better* than a scalar 32-bit `mov` loop. (nvm, you already deleted that). – Peter Cordes May 16 '19 at 11:52
If you have AVX, you need to use `vpxor xmm0,xmm0,xmm0` not `pxor xmm0,xmm0` to zero a YMM register. The legacy SSE encoding leaves the upper 128 bits unmodified. And for ZMM, you need AVX512 not just AVX. And `movntps` is available (and a good idea for this) starting with SSE1; you don't need AVX for it. – Peter Cordes May 16 '19 at 11:53
Also, to use SSE or AVX in a freestanding kernel, the OP will need to set a few bits in control registers, otherwise SSE and AVX instructions will fault. (This is how Intel avoids the problem of silent data corruption of the new architectural state for user-space code running on old OSes.) – Peter Cordes May 16 '19 at 11:56
@PeterCordes How is it now? – 0x777C May 16 '19 at 12:03
It's not totally wrong anymore, but you skip right from SSE to AVX512 without mentioning the widely-available AVX, and don't mention `movntps` for SSE1. Also, zeroing a ZMM register is still best done with a VEX-encoded `vpxor xmm0, xmm0, xmm0`, not a longer EVEX `vpxorq` unless its one of zmm16..31 ([Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?](//stackoverflow.com/q/43713273)). But the OP is writing 32-bit code, so only zmm0..7 are available anyway. And of course AVX512 NT stores would be `vmovntps`. – Peter Cordes May 16 '19 at 12:11

How to quickly erase huge block of video RAM in 32-bit assembly

1 Answers1