I'm trying to code a few high-performance assembly functions as an exercise, and have encountered a weird segfault that happens when running the program, but not in valgrind or nemiver.
Basically a cmov that shouldn't be run, with an out-of-bound address, makes me segfault even if the condition is always false
I have a fast and a slow version. The slow one works all the time. The fast one works, unless it receives a non-ascii char, at which point it crashes horribly, unless I'm running on adb or nemiver.
ascii_flags is simply a 128 bytes array (with a bit of room at the end) containing flags on all ASCII characters (alpha, numeric, printable, etc.)
this works:
ft_isprint:
xor EAX, EAX ; empty EAX
test EDI, ~127 ; check for non-ascii (>127) input
jnz .error
mov EAX, [rel ascii_flags + EDI] ; load ascii table if input fits
and EAX, 0b00001000 ; get specific bit
.error:
ret
but this doesn't:
ft_isprint:
xor EAX, EAX ; empty EAX
test EDI, ~127 ; check for non-ascii (>127) input
cmovz EAX, [rel ascii_flags + EDI] ; load ascii table if input fits
and EAX, flag_print ; get specific bit
ret
Valgrind does actually crash, but with no other information than memory addresses, since I've not managed to get more debugging information.
Edit:
I've written three versions of the functions to take in account the wonderful answers:
ft_isprint:
mov RAX, 128 ; load default index
test RDI, ~127 ; check for non-ascii (>127) input
cmovz RAX, RDI ; if none are found, load correct index
mov AL, byte [ascii_flags + RAX] ; dereference index into least sig. byte
and RAX, flag_print ; get specific bit (and zeros rest of RAX)
ret
ft_isprint_branch:
test RDI, ~127 ; check for non-ascii (>127) input
jnz .out_of_bounds ; if non-ascii, jump to error handling
mov AL, byte [ascii_flags + RDI] ; dereference index into least sig. byte
and RAX, flag_print ; get specific bit (and zeros rest of RAX)
ret
.out_of_bounds:
xor RAX, RAX ; zeros return value
ret
ft_isprint_compact:
xor RAX, RAX ; zeros return value preemptively
test RDI, ~127 ; check for non-ascii (>127) input
jnz .out_of_bounds ; if non-ascii was found, skip dereferenciation
mov AL, byte [ascii_flags + RDI] ; dereference index into least sig. byte
and RAX, flag_print ; get specific bit
.out_of_bounds:
ret
After extensive testing, the branching functions are definitely faster than the cmov function by a factor of about 5-15% on all types of data. The difference between the compact and non-compact version is, as expected minimal. Compact is ever slightly faster on a predictable data set, while the non compact is just as slightly faster on non-predictable data.
I tried various different ways to skip the 'xor EAX, EAX' instruction, but couldn't find any that works.
Edit: after more testing, I've updated the code to three new versions:
ft_isprint_compact:
sub EDI, 32 ; substract 32 from input, to overflow any value < ' '
xor EAX, EAX ; set return value to 0
cmp EDI, 94 ; check if input <= '~' - 32
setbe AL ; if so, set return value to 1
ret
ft_isprint_branch:
xor EAX, EAX ; set return value to 0
cmp EDI, 127 ; check for non-ascii (>127) input
ja .out_of_bounds ; if non-ascii was found, skip dereferenciation
mov AL, byte [rel ascii_flags + EDI] ; dereference index into least sig. byte
.out_of_bounds:
ret
ft_isprint:
mov EAX, 128 ; load default index
cmp EDI, EAX ; check if ascii
cmovae EDI, EAX ; replace with 128 if outside 0..127
; cmov also zero-extends EDI into RDI
; movzx EAX, byte [ascii_flags + RDI] ; alternative to two following instruction if masking is removed
mov AL, byte [ascii_flags + RDI] ; load table entry
and EAX, flag_print ; apply mask to get correct bit and zero rest of EAX
ret
The performances are as follow, in microseconds. The 1-2-3 show the order of execution, to avoid a caching advantage:
-O3 a.out
1 cond 153185, 2 branch 238341 3 no_table 145436
1 cond 148928, 3 branch 248954 2 no_table 116629
2 cond 149599, 1 branch 226222 3 no_table 117428
2 cond 117258, 3 branch 241118 1 no_table 147053
3 cond 117635, 1 branch 228209 2 no_table 147263
3 cond 146212, 2 branch 220900 1 no_table 147377
-O3 main.c
1 cond 132964, 2 branch 157963 3 no_table 131826
1 cond 133697, 3 branch 159629 2 no_table 105961
2 cond 133825, 1 branch 139360 3 no_table 108185
2 cond 113039, 3 branch 162261 1 no_table 142454
3 cond 106407, 1 branch 133979 2 no_table 137602
3 cond 134306, 2 branch 148205 1 no_table 141934
-O0 a.out
1 cond 255904, 2 branch 320505 3 no_table 257241
1 cond 262288, 3 branch 325310 2 no_table 249576
2 cond 247948, 1 branch 340220 3 no_table 250163
2 cond 256020, 3 branch 415632 1 no_table 256492
3 cond 250690, 1 branch 316983 2 no_table 257726
3 cond 249331, 2 branch 325226 1 no_table 250227
-O0 main.c
1 cond 225019, 2 branch 224297 3 no_table 229554
1 cond 235607, 3 branch 199806 2 no_table 226286
2 cond 226739, 1 branch 210179 3 no_table 238690
2 cond 237532, 3 branch 223877 1 no_table 234103
3 cond 225485, 1 branch 201246 2 no_table 230591
3 cond 228824, 2 branch 202015 1 no_table 226788
The no table version is as about as fast as the cmov, but doesn't allow for easily implementable locals. The branching algorithm is worse unless on predictable data in zero optimization ? I've got no explanations there.
I'll keep the cmov version, which is both the most elegant and easily updatable. Thanks for all the help.