How to get efficient asm for zeroing a tiny struct with MSVC++ for x86-32?

Question

My project is compiled for 32-bit in both Windows and Linux. I have an 8-byte struct that's used just about everywhere:

struct Value {
    unsigned char type;
    union {  // 4 bytes
        unsigned long ref;
        float num;
    }
};

In a lot of places I need to zero out the struct, which is done like so:

#define NULL_VALUE_LITERAL {0, {0L}};
static const Value NULL_VALUE = NULL_VALUE_LITERAL;

// example of clearing a value
var = NULL_VALUE;

This however does not compile to the most efficient code in Visual Studio 2013, even with all optimizations on. What I see in the assembly is that the memory location for NULL_VALUE is being read, then written to the var. This results in two reads from memory and two writes to memory. This clearing however happens a lot, even in routines that are time-sensitive, and I'm looking to optimize.

If I set the value to NULL_VALUE_LITERAL, it's worse. The literal data, which again is all zeroes, is copied into temporary a stack value and THEN copied to the variable--even if the variable is also on the stack. So that's absurd.

There's also a common situation like this:

*pd->v1 = NULL_VALUE;

It has similar assembly code to the var=NULL_VALUE above, but it's something I can't optimize with inline assembly should I choose to go that route.

From my research the very, very fastest way to clear the memory would be something like this:

xor eax, eax
mov byte ptr [var], al
mov dword ptr [var+4], eax

Or better still, since the struct alignment means there's just junk for 3 bytes after the data type:

xor eax, eax
mov dword ptr [var], eax
mov dword ptr [var+4], eax

Can you think of any way I can get code similar to that, optimized to avoid the memory reads that are totally unnecessary?

I tried some other methods, which end up creating what I feel is overly bloated code writing a 32-bit 0 literal to the two addresses, but IIRC writing a literal to memory still isn't as fast as writing a register to memory. I'm looking to eke out any extra performance I can get.

Ideally I would also like the result to be highly readable. Your help is appreciated.

"*writing a literal to memory still isn't as fast as writing a register to memory*" - not true. And have you tried `memset`? — rustyx, Nov 19 '17 at 20:19
@xs0 - This could cause problems with alignment, depending on platform and method of allocation... — immortal, Nov 19 '17 at 20:20
This is heavily dependent on compiler, but in your case, VS, you should be able to use `__declspec(align(32))` to declare and enforce that your value is 32 bit aligned. In this case, I think the compiler should be able to better optimize your code... https://msdn.microsoft.com/en-us/library/83ythb65.aspx — immortal, Nov 19 '17 at 20:26
It's hard to optimize C code when the compiler (MSVC) is as dumb as a bag of bricks. — fuz, Nov 19 '17 at 20:30
Can't you simply use NULL_VALUE_LITERAL instead of NULL_VALUE? Or you can just add a `setZero` function. Or maybe you can make NULL_VALUE a special type, and have an operator= which has an overload for this type (and this overload puts zeros into Value) — geza, Nov 19 '17 at 20:31
@RustyX, I think memset would be a very poor fit since this is just an 8-byte struct. — Lummox JR, Nov 19 '17 at 20:39
@geza, as I mentioned above the literal value doesn't copy the way I'd want it to; it copies to a temporary variable on the stack and then to the destination, which is worse. I'm not sure I want to risk the compiler failing to inline a function for this purpose either. — Lummox JR, Nov 19 '17 at 20:43
Can't you just use a smarter compiler for your Windows build? e.g. gcc or clang? This question seems to be all about working around a compiler missed-optimization. — Peter Cordes, Nov 19 '17 at 20:45
For many reasons I'm really not interested in changing up the build process that drastically, so VS2013 it is. — Lummox JR, Nov 19 '17 at 20:48
@geza, an operator= function would have to be inlined to be more efficient, and I can't trust the compiler to inline--at least not in debug mode. Although I understand that the un-optimized debug build (the compiler freaks out when I turn optimizations on for it) won't be the same as release, I'd like the code to be at least as performant in debug mode as it is currently. An inline function might speed things up in release and slow them down in debug. Although I guess there's macros for that. — Lummox JR, Nov 19 '17 at 20:51
btw, if you change `type` to `int`, then initializing from NULL_VALUE_LITERAL becomes two `mov .., 0` instructions. — geza, Nov 19 '17 at 20:57
@LummoxJR: Why do you care what the compiler does in debug mode? Do you have some kind of minimum perf requirement that you need even when debugging? — Peter Cordes, Nov 19 '17 at 21:23
@immortal, `__declspec(align(32))` requests 32 *byte* alignment. — prl, Nov 20 '17 at 03:15
@geza, I did some testing with __forceinline after changing the debug build to allow it. Looks like the *pd->v1 case is not handled gracefully; the pointer is stored in a temp var on the stack. — Lummox JR, Nov 20 '17 at 04:50
@PeterCordes, the reason I want to maintain decent performance in debug mode is that this is a game engine. If debug mode got too slow it'd hinder bug testing. — Lummox JR, Nov 20 '17 at 04:50
@LummoxJR: well you might not be able to have it both ways. If you're using a compiler with as weak an optimizer as MSVC, there are only limited ways to get good optimized code, and that might be worse for debug mode. Can't you build with debug symbols and light optimizations for play-testing? Like `-O1` or something? — Peter Cordes, Nov 20 '17 at 05:09
BTW, `-O1` isn't enough to inline `memset`, unfortunately. With no options or `-O1`, as you can try on Godbolt, you still do get an actual call to `memset`. But `-O1` might overall make everything else faster so those memset calls aren't a problem. I suggest you try it and see if it's a perf problem. It probably won't be, unless somehow the rest of your code is really well optimized for debug mode and you use this inside tight loops all over the place. Also, the library memset implementation is hopefully not terrible for 8-byte calls. — Peter Cordes, Nov 20 '17 at 05:18
The best solution I can find appears to be using a macro to copy 0 into the value as if it's a quad pointer. I did end up trying out __forceinline (and its gcc equivalent) for a different routine related to this, where I think the compiler wasn't making a good enough decision on its own. — Lummox JR, Nov 21 '17 at 17:26

Peter Cordes · Answer 1 · 2017-11-19T21:26:40.313

I'd recommend uint32_t or unsigned int for the union with float. long on Linux x86-64 is a 64-bit type, which is probably not what you want.

I can reproduce the missed-optimization with MSVC CL19 -Ox on the Godbolt compiler explorer for x86-32 and x86-64. Workarounds that work with CL19:

make type an unsigned int instead of char, so there's no padding in the struct, then assign from a literal {0, {0L}} instead of a static const Value object. (Then you get two mov-immediate stores: mov DWORD PTR [eax], 0 / mov DWORD PTR [eax+4], 0).

gcc also has struct-zeroing missed-optimizations with padding in structs, but not as bad as MSVC (Bug 82142). It just defeats merging into wider stores; it doesn't get gcc to create an object on the stack and copy from that.
std::memset: probably the best option, MSVC compiles it to a single 64-bit store using SSE2. xorps xmm0, xmm0 / movq QWORD PTR [mem], xmm0. (gcc -m32 -O3 compiles this memset to two mov-immediate stores.)

void arg_memset(Value *vp) {
  memset(vp, 0, sizeof(gvar));
}

   ;; x86 (32-bit) MSVC -Ox
    mov      eax, DWORD PTR _vp$[esp-4]
    xorps    xmm0, xmm0
    movq     QWORD PTR [eax], xmm0
    ret      0

This is what I'd choose for modern CPUs (Intel and AMD). The penalty for crossing a cache-line is low enough that it's worth saving an instruction if it doesn't happen all the time. xor-zeroing is extremely cheap (especially on Intel SnB-family).

IIRC writing a literal to memory still isn't as fast as writing a register to memory

In asm, constants embedded in the instruction are called immediate data. mov-immediate to memory is mostly fine on x86, but it's a bit bloated for code-size.

(x86-64 only): A store with a RIP-relative addressing mode and an immediate can't micro-fuse on Intel CPUs, so it's 2 fused-domain uops. (See Agner Fog's microarch pdf, and other links in the x86 tag wiki.) This means it's worth it (for front-end bandwidth) to zero a register if you're doing more than one store to a RIP-relative address. Other addressing modes do fuse, though, so it's just a code-size issue.

Related: Micro fusion and addressing modes (indexed addressing modes un-laminate on Sandybridge/Ivybridge, but Haswell and later can keep indexed stores micro-fused.) This isn't dependent on immediate vs. register source.

I think memset would be a very poor fit since this is just an 8-byte struct.

Modern compilers know what some heavily-used / important standard library functions do (memset, memcpy, etc.), and treat them like intrinsics. There's very little difference as far as optimization is concerned between a = b and memcpy(&a, &b, sizeof(a)) if they have the same type.

You might get a function call to the actual library implementation in debug mode, but debug mode is very slow anyway. If you have debug-mode perf requirements, that's unusual. (But does happen for code that needs to keep up with something else...)

Internally the union is defined properly as a definite four-byte value, so there's no worry there. Changing the type value to 4 bits occurred to me but I suspect in a project this size and complexity it'd break something. As for the memset option, is it actually safe to use a qword pointer with SSE2 if it's not necessarily aligned to 8 bytes? — Lummox JR, Nov 20 '17 at 04:26
@LummoxJR: Yes; if it wasn't the compiler wouldn't make this optimization. Only 16-byte and larger loads/stores ever have alignment requirements on x86 with "regular" load/store instructions like `movq`. That's why there aren't separate `movqa` and `movqu` instructions (like `movdqu`), just `movq`. That's why I mentioned possible performance issues from stores crossing a cache-line boundary if one of these objects is only 4-byte aligned and is split across a cache line. (naturally-aligned accesses can never cross a boundary larger than their own size). — Peter Cordes, Nov 20 '17 at 05:12

How to get efficient asm for zeroing a tiny struct with MSVC++ for x86-32?

1 Answers1