Hard to debug SEGV due to skipped cmov from out-of-bounds memory

Question

I'm trying to code a few high-performance assembly functions as an exercise, and have encountered a weird segfault that happens when running the program, but not in valgrind or nemiver.

Basically a cmov that shouldn't be run, with an out-of-bound address, makes me segfault even if the condition is always false

I have a fast and a slow version. The slow one works all the time. The fast one works, unless it receives a non-ascii char, at which point it crashes horribly, unless I'm running on adb or nemiver.

ascii_flags is simply a 128 bytes array (with a bit of room at the end) containing flags on all ASCII characters (alpha, numeric, printable, etc.)

this works:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    jnz .error
    mov EAX, [rel ascii_flags + EDI]    ; load ascii table if input fits
    and EAX, 0b00001000         ; get specific bit
.error:
    ret

but this doesn't:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    cmovz EAX, [rel ascii_flags + EDI]  ; load ascii table if input fits
    and EAX, flag_print         ; get specific bit
    ret

Valgrind does actually crash, but with no other information than memory addresses, since I've not managed to get more debugging information.

Edit:

I've written three versions of the functions to take in account the wonderful answers:

ft_isprint:
    mov RAX, 128                            ; load default index
    test RDI, ~127                          ; check for non-ascii (>127) input
    cmovz RAX, RDI                          ; if none are found, load correct index
    mov AL, byte [ascii_flags + RAX]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret

ft_isprint_branch:
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii, jump to error handling
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret
.out_of_bounds:
    xor RAX, RAX                            ; zeros return value
    ret

ft_isprint_compact:
    xor RAX, RAX                            ; zeros return value preemptively
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii was found, skip dereferenciation
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit
.out_of_bounds:
    ret

After extensive testing, the branching functions are definitely faster than the cmov function by a factor of about 5-15% on all types of data. The difference between the compact and non-compact version is, as expected minimal. Compact is ever slightly faster on a predictable data set, while the non compact is just as slightly faster on non-predictable data.

I tried various different ways to skip the 'xor EAX, EAX' instruction, but couldn't find any that works.

Edit: after more testing, I've updated the code to three new versions:

ft_isprint_compact:
    sub EDI, 32                             ; substract 32 from input, to overflow any value < ' '
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 94                             ; check if input <= '~' - 32
    setbe AL                                ; if so, set return value to 1
    ret

ft_isprint_branch:
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 127                            ; check for non-ascii (>127) input
    ja .out_of_bounds                       ; if non-ascii was found, skip dereferenciation
    mov AL, byte [rel ascii_flags + EDI]    ; dereference index into least sig. byte
.out_of_bounds:
    ret

ft_isprint:
    mov EAX, 128                            ; load default index
    cmp EDI, EAX                            ; check if ascii
    cmovae EDI, EAX                         ; replace with 128 if outside 0..127
                                            ; cmov also zero-extends EDI into RDI
;   movzx EAX, byte [ascii_flags + RDI]     ; alternative to two following instruction if masking is removed
    mov AL, byte [ascii_flags + RDI]        ; load table entry
    and EAX, flag_print                     ; apply mask to get correct bit and zero rest of EAX
    ret

The performances are as follow, in microseconds. The 1-2-3 show the order of execution, to avoid a caching advantage:

-O3 a.out
1 cond 153185, 2 branch 238341 3 no_table 145436
1 cond 148928, 3 branch 248954 2 no_table 116629
2 cond 149599, 1 branch 226222 3 no_table 117428
2 cond 117258, 3 branch 241118 1 no_table 147053
3 cond 117635, 1 branch 228209 2 no_table 147263
3 cond 146212, 2 branch 220900 1 no_table 147377
-O3 main.c
1 cond 132964, 2 branch 157963 3 no_table 131826
1 cond 133697, 3 branch 159629 2 no_table 105961
2 cond 133825, 1 branch 139360 3 no_table 108185
2 cond 113039, 3 branch 162261 1 no_table 142454
3 cond 106407, 1 branch 133979 2 no_table 137602
3 cond 134306, 2 branch 148205 1 no_table 141934
-O0 a.out
1 cond 255904, 2 branch 320505 3 no_table 257241
1 cond 262288, 3 branch 325310 2 no_table 249576
2 cond 247948, 1 branch 340220 3 no_table 250163
2 cond 256020, 3 branch 415632 1 no_table 256492
3 cond 250690, 1 branch 316983 2 no_table 257726
3 cond 249331, 2 branch 325226 1 no_table 250227
-O0 main.c
1 cond 225019, 2 branch 224297 3 no_table 229554
1 cond 235607, 3 branch 199806 2 no_table 226286
2 cond 226739, 1 branch 210179 3 no_table 238690
2 cond 237532, 3 branch 223877 1 no_table 234103
3 cond 225485, 1 branch 201246 2 no_table 230591
3 cond 228824, 2 branch 202015 1 no_table 226788

The no table version is as about as fast as the cmov, but doesn't allow for easily implementable locals. The branching algorithm is worse unless on predictable data in zero optimization ? I've got no explanations there.

I'll keep the cmov version, which is both the most elegant and easily updatable. Thanks for all the help.

`cmov` performs its register/memory read _unconditionally_. What's conditional is the move into the destination register. See the _Intel SDM, Volume 2, Instruction Set Reference_. — Iwillnotexist Idonotexist, Jan 05 '19 at 08:34
If it doesn't segfault in valgrind, then valgrind has implemented its simulation incorrectly. — fuz, Jan 05 '19 at 12:27
re: your update: why did you start using RAX for xor-zeroing and the AND? That's a pointless waste of a byte, although NASM will actually optimize it back to `xor eax,eax` for you. YASM won't. [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](https://stackoverflow.com/a/33668295) But if you're loading into AL, you should definitely `and AL, flag_print`. The assembler won't do that for you. — Peter Cordes, Jan 06 '19 at 03:37
Also note that you've now changed the function signature: it now cares about the upper 32 bits of RDI, so it's `uint64_t` instead of `unsigned`. If you tell a C compiler that it takes `int` or `unsigned`, the compiler may call your function with garbage in the high bits, expecting your function to ignore it. — Peter Cordes, Jan 06 '19 at 03:38
When you say "all kinds of data", did you include any that actually did have a mix of bytes with their high bit set? e.g. UTF-8 for many non-English European languages that use many accented characters but also many plain Latin alphabetic characters? i.e. not asian languages which are all high-bit set. Or if this is for an 8-bit charset like ISO Latin-1, then accented characters again use high 8-bit encodings. So you'd expect some mispredicts. Of course, function-call overhead is pretty big for such a tiny function. Inlining it would help a lot in a loop. esp. if you branch on the flag. — Peter Cordes, Jan 06 '19 at 03:41
I changed to RAX because I was under the impression that it was somehow more compact. If it looks like I know what I'm doing I'm sorry for the misrepresentation. For the dataset, I literally fed the binary file to itself, assuming that it would contain a wide range of possible byte values, while not being fully random. I will look into the signature. — Lou Garczynski, Jan 06 '19 at 04:43
Benchmarking with `-O0` is pointless. The bottlenecks are *different* than in normal optimized code, e.g. store-forwarding latency. Writing your simple range-check version in C would let the compiler inline similar asm: gcc knows how to implement a range-check with a sub + unsigned compare. Like I said in my answer, these functions are too small to be stand-alone and `call`ed inside a loop. Anyway, without seeing what kind of thing your test loop is doing, no way to really explain much about your results. More branching in `-O0` code could dilute branch-prediction for your branch, though. — Peter Cordes, Jan 07 '19 at 10:43

Peter Cordes · Accepted Answer · 2023-07-29T21:55:47.477

cmov is an ALU select operation that always reads both sources before checking the condition. Using a memory source doesn't change this. It's not like an ARM predicated instruction that acts like a NOP if the condition was false. cmovz eax, [mem] also unconditionally writes EAX, zero-extending into RAX regardless of the condition.

As far as the most of the CPU is concerned (the out-of-order scheduler and so on), cmovcc reg, [mem] is handled exactly like adc reg, [mem]: a 3-input 1-output ALU instruction. (adc writes flags, unlike cmov, but nevermind that.) The micro-fused memory source operand is a separate uop that just happens to be part of the same x86 instruction. This is how the ISA rules for it work, too.

So really, a more appropriate mnemonic for cmovz would be selectz. (Or AArch64's csel.)

Intel APX (Advanced Performance Extensions) will add CFCMOVcc (Conditionally Faulting CMOVcc) - conditional-load / conditional-store instructions, which do suppress memory faults with a false predicate, like AVX512 masking. (Encoded with an EVEX prefix for the cmovcc opcodes.)

x86's only conditional loads (that don't fault on bad addresses, just potentially run slowly) are:

Normal loads protected by conditional branches. Branch mis-prediction or other mis-speculations leading to running a faulting load are handled fairly efficiently (maybe starting a page walk, but once the mis-speculation is identified, execution of the correct flow of instructions doesn't have to wait for any memory operation started by speculative execution).

If there was a TLB hit on a page you can't read, then not much more happens until a faulting load reaches retirement (known to be non-speculative and thus actually taking a #PF page-fault exception which is unavoidably going to be slow). On some CPUs, this fast handling leads to the Meltdown attack. >.< See http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.
rep lodsd with RCX=0 or 1. (not fast or efficient, but microcode branches are special and can't benefit from branch prediction, on Intel CPUs. See What setup does REP do?. Andy Glew mentions microcode branch mispredictions, but I think those are different from normal branch misses because there seems to be a fixed cost.)
AVX2 vpmaskmovd/q / AVX1 vmaskmovps/pd. Faults are suppressed for elements where the mask is 0. A mask-load with an all-0 mask even from a legal address requires a ~200 cycle microcode assist with a base+index addressing mode.) See section 12.9 CONDITIONAL SIMD PACKED LOADS AND STORES and Table C-8 in Intel's optimization manual. (On Skylake, stores to an illegal address with an all-zero mask also need an assist.)

The earlier MMX/SSE2 maskmovdqu is store-only (and has an NT hint). Only the similar AVX instruction with dword/qword (instead of byte) elements has a load form.
AVX512 masked loads
AVX2 gathers with some / all mask elements cleared.
(future CPUs) APX cfmovcc reg, reg/[mem] and cfmovcc dst, src1, src2/[mem]. Also a store version. Presumably they use some of the same hardware support that AVX-512 masked scalar loads/stores use. (APX also provides a ton of new stuff, like 3-operand encodings of most legacy scalar-integer instructions, 32 GPRs, and finally a setcc form that writes a whole 32/64-bit register.)

... and maybe others I'm forgetting. Normal loads inside TSX / RTM transactions: a fault aborts the transaction instead of raising a #PF. But you can't count on a bad index faulting instead of just reading bogus data from somewhere nearby, so it's not really a conditional load. It's also not super fast to enter a transaction.

An alternative might be to cmov an address that you use unconditionally, selecting which address to load from. e.g. if you had a 0 to load from somewhere else, that would work. But then you'd have to calculate the table indexing in a register, not using an addressing mode, so you could cmov the final address.

Or just CMOV the index and pad the table with some zero bytes at the end so you can load from table + 128.

Or use a branch, it will probably predict well for a lot of cases. But maybe not for languages like French where you'll find a mix of low-128 and higher Unicode code-points in common text.

Code Review

Note that [rel] only works when there's no register (other than RIP) involved in the addressing mode. RIP-relative addressing replaces one of the 2 redundant ways (in 32-bit code) to encode a [disp32]. It uses the shorter non-SIB encoding, while a ModRM+SIB can still encode an absolute [disp32] with no registers. (Useful for addresses like [fs: 16] for small offsets relative to thread-local storage with segment bases.)

If you just want to use RIP-relative addressing when possible, use default rel at the top of your file. [symbol] will be RIP-relative, but [symbol + rax] won't. Unfortunately, NASM and YASM default to default abs.

[reg + disp32] is a very efficient way to index static data in position-dependent code, just don't fool yourself into thinking that it can be RIP-relative. See 32-bit absolute addresses no longer allowed in x86-64 Linux? and a question about indexing arrays (RIP-relative LEA the static address address and use a 2-register addressing mode.)

[rel ascii_flags + EDI] is also weird because you're using a 32-bit register in an addressing mode in x86-64 code. There's usually no reason to spend an address-size prefix to truncate addresses to 32-bit.

However, in this case if your table is in the low 32-bits of virtual address space, and your function arg is only specified as 32 bits (so the caller is allowed to leave garbage in the upper 32 of RDI), it is actually a win to use [disp32 + edi] instead of a mov esi,edi or something to zero-extend. If you're doing that on purpose, definitely comment why you're using a 32-bit addressing mode.

But in this case, using a cmov on the index will zero-extend to 64-bit for you.

It's also weird to use a DWORD load from a table of bytes. You'll occasionally cross a cache-line boundary and suffer extra latency. Use movzx eax, byte [mem] if you just want a byte.

@fuz showed a version using a RIP-relative LEA and a CMOV on the index.

In position-dependent code where 32-bit absolute addresses are ok, by all means use it to save instructions. [disp32] addressing modes are worse than RIP-relative (1 byte longer), but [reg + disp32] addressing modes are perfectly fine when position-dependent code and 32-bit absolute addresses are ok. (e.g. x86-64 Linux, but not OS X where executable are always mapped outside the low 32 bits.) Just be aware that it's not rel.

; position-dependent version taking advantage of 32-bit absolute [reg + disp32] addressing
; not usable in shared libraries, only non-PIE executables.
ft_isprint:
    mov     eax, 128               ; offset of dummy entry for "not ASCII"
    cmp     edi, eax               ; check if ascii
    cmovae  edi, eax               ; replace with 128 if outside 0..127
              ; cmov also zero-extends EDI into RDI
    movzx   eax, byte [ascii_flags + rdi] ; load table entry
    and     al, flag_print         ; mask the desired flag
      ; if the caller is only going to read / test AL anyway, might as well save bytes here
    ret

If any existing entry in your table has the same flags you want for high inputs, e.g. maybe entry 0 which you'll never see in implicit-length strings, you could still xor-zero EAX and keep your tables at 128 bytes, not 129.

test r32, imm32 takes more code bytes than you need. ~127 = 0xFFFFFF80 would fit in a sign-extended byte, but is no TEST r/m32, sign-extended-imm8 encoding. There is such an encoding for cmp, though, like essentially all other immediate instructions.

You could instead check for unsigned above 127, with cmp edi, 127 / cmovbe eax, edi or cmova edi, eax. This saves 3 bytes of code-size. Or we can save 4 bytes by using cmp reg,reg using the 128 we used for a table index.

A range-check before array indexing is also more intuitive for most humans than checking high bits anyway.

and al, imm8 is only 2 bytes, vs. 3 bytes for and r/m32, sign-extended-imm8. It's not slower on any CPUs, as long as the caller only reads AL. On Intel CPUs before Sandybridge, reading EAX after ANDing into AL could cause a partial-register stall / slowdown. Sandybridge doesn't rename partial registers for read-modify-write operations, if I recall correctly, and IvB and later don't rename low8 partial regs at all.

You might also use mov al, [table] instead of movzx to save another code byte. An earlier mov eax, 128 already broke any false dependency on the old value of EAX so it shouldn't have a performance downside. But movzx is not a bad idea.

When all else is equal, smaller code-size is almost always better (for instruction-cache footprint, and sometimes for packing into the uop cache). If it cost any extra uops or introduced any false dependencies, it wouldn't be worth it when optimizing for speed, though.

Dear lord that's more in depth than I could have ever expected. I'll take more time to read all that, but thanks so much for your time. — Lou Garczynski, Jan 06 '19 at 03:29
@LouisGarczynski: I don't remember anyone asking about x86 conditional loads, so I thought your question looked like a good excuse to write about it. It's interesting that ARM predicated instructions really do work the way you were expecting CMOV to work, where the whole instruction is nulled, instead of just being an ALU select operation. I hadn't really thought about that CPU-architecture difference before. — Peter Cordes, Jan 06 '19 at 03:51
Your solution still works amazingly, but 'mov reg, [table + reg]' doesn't assemble for macho64, forcing me to lose another instruction. Is that kind of instruction just impossible on MacOS ? — Lou Garczynski, Jan 08 '19 at 07:23
@LouisGarczynski: That's correct. See [Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array](https://stackoverflow.com/q/47300844). Like my answer says, it works on *e.g. x86-64 Linux, but not OS X where executable are always mapped outside the low 32 bits.* I hadn't realized you were were on OS X, or I would have suggested something that would work there, with 32-bit absolute addressing only as an alternative, not the main code. — Peter Cordes, Jan 08 '19 at 12:55
I'm just trying to write code that works on both x86-64 linux and OSX. It has been... fun. — Lou Garczynski, Jan 08 '19 at 14:48

score 6 · Answer 2 · edited Jan 06 '19 at 02:06

As Peter Cordes explained, cmovCC unconditionally loads from memory. One thing you can do to alleviate this issue is to first do a conditional move on edi to clear out edi if the character is out of range, causing the conditional move to load from ascii_flags[0] and avoiding your issue. Conveniently, eax is already clear when you do that.

Note also that you might want to avoid 32 bit registers as base and index registers as they require an extra prefix to represent and might be slower on some architectures. Just use their 64 bit counterparts.

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    cmovnz EDI, EAX             ; clear EDI if not ascii
    cmovz EAX, [ascii_flags + RDI]  ; load ascii table if input fits
    and EAX, flag_print         ; get specific bit
    ret

To address Peter Cordes other issues, I would actually use code like this:

; PIC/PIE safe version, doing only a byte load
ft_isprint:
    lea   rsi, [rel ascii_flags] ; load address of ascii_flags
    mov   eax, 128               ; load offset of dummy entry for "not ASCII"
    test   edi, ~127             ; check if ascii
    cmovz  eax, edi              ; load proper entry if ascii
    movzx  eax, byte [rsi + rax] ; load table entry
    and    eax, flag_print       ; mask the desired flag
    ret

`[esi, eax]`? That's a 32-bit addressing mode, and looks like half way between Intel and AT&T syntax. If you were targeting the x32 ABI with PIC/PIE code, you might do a RIP-relative LEA with 32-bit operand-size, but you'd still use 64-bit addressing modes because you know RSI and RDI are zero-extended. — Peter Cordes, Jan 06 '19 at 01:50
I bugfixed your 2nd version to what I think you meant to write. I added a more optimized non-PIE version of that to my own answer since I had more to say than would be appropriate for an edit. Your first version using 2x `cmov` is interesting, and might actually be good on Broadwell+, and on AMD. I hadn't been thinking of doing that. — Peter Cordes, Jan 06 '19 at 02:30
@PeterCordes Thanks for that. I am not sure what I was thinking back then. — fuz, Jan 06 '19 at 02:37

Hard to debug SEGV due to skipped cmov from out-of-bounds memory

2 Answers2

Code Review

Linked

Related