I keep seeing these palindrome questions with clunky code from beginners, which got me interested in writing a version that's actually efficient. My byte-at-a-time loop should run at one iteration per clock on Intel Haswell, but slower on earlier Intel (because the loop will be more than 4 fused-domain uops because of limited macro-fusion for back-to-back cmp/jcc).
Also see below for a way to make it case-insensitive.
Checking more than one byte at a time works, even with overlap at the crossover in the middle:
abcdcba
abcd
dcba ; overlap by one
bcdc
cdcb ; overlap by two
Load some bytes, reverse the order with bswap
, Silvermont/Haswell movbe
, SSSE3 pshufb
, or even rol ax,8
/rol eax,16
/rol ax,8
. Then compare with the same number of bytes where there should be a match.
Add in printing of results however you like. I just return an exit status.
All of this works the same in 32bit, but I used 64bit because the register calling convention avoids cluttering in the code with stack manipulation.
Note the use of local labels (.label
) to avoid conflicts between similar blocks of code that use the same label names.
DEFAULT rel
section .text
ALIGN 16
global check_palindrome
; AMD64 SysV calling convention
check_palindrome: ; (size_t len /*rdi*/, const char *str /*rsi*/)
;;returns bool
cmp rdi, 8
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
add rdi, rsi ; rdi = end pointer
;ALIGN 16 ; probably not worth it, depending on the CPU
.palin_loop:
lodsq ; rax = [rsi], rsi+=8. lodsd/q is only 2 uops on Haswell
bswap rax
sub rdi, 8
cmp rax, [rdi]
jne .not_palin
cmp rsi, rdi
jb .palin_loop ; stop looping when the pointers cross
;; Loop has 7 uops on Haswell, so it can sustain one iteration (8 bytes in each direction) per 2 cycles.
;; 9 uops on SnB, where lodsq is 3 uops, and only one of cmp/jcc pairs can macro-fuse. For SnB, use mov / add instead of lodsq
;; with unrolling, we might get this down closer to 2x8B per clock, instead of per 2 clocks.
;; or with SSE/AVX, 2x16B per clock. Probably not 2x32B per clock with AVX, due to needing two shuffles. (no cross-lane byte shuffle until AVX512)
; input was a palindrome
mov eax, 1 ; return true
ret
.not_palin:
xor eax,eax ; return false
ret
ALIGN 16
;; helper function with the same signature as the main version
; only needed for small strings, not for unaligned, or not-multiple-of-8
; assume that rdi < 2^32 so we can use edi interchangeably, for smaller code-size.
; If our prototype was (unsigned int, const char*), we'd have to mov ecx, edi or something.
; (moving to a different reg is preferable to mov edi,edi because mov-elimination never works on mov same,same)
check_palindrome_byte_at_a_time:
cmp edi, 1
jbe .is_palin ; the empty string is a palindrome, as is length 1
;ALIGN 16
.palin_loop: ; do{
movzx eax, byte ptr [rsi + rdi - 1] ; 2-register addresses can't micro-fuse on SnB-family, but this is a pure load that doesn't need to micro-fuse with anything. MOVZX avoids false dependencies and extra merging uops on CPUs that don't rename AL separate from EAX; it would need to micro-fuse on Haswell if we'd used mov al, [mem]
cmp al, [rsi]
jne check_palindrome.not_palin
inc rsi ; pointer moves forward
sub edi, 2 ; index counts down towards zero twice as fast
ja .palin_loop ; }while((idx-=2) > 0U); Ideally would leave the loop on EDI==1 as well, as any byte equals itself
;;; Haswell and later can fuse both branches even when they hit the decoders in the same cycle
;;; so this loop should hopefully be 4 fused-domain uops and run at one iteration per clock
.is_palin:
mov eax, 1 ; return true
ret
; .not_palin: ; shared code with other check_palindrome version
global main
extern strlen
ALIGN 16
main:
;; return !check_palindrome(strlen(argv[1]), argv[1])
;mov edi, inputstr_len
;mov esi, inputstr
mov rdi, [rsi+8] ; argv[1]
push rdi ; save it, and align the stack
call strlen
mov rdi, rax
mov rsi, [rsp] ; could pop here and take advantage of the fact that we know check_palindrome doesn't care about the stack
call check_palindrome
pop rdi ; shorter than add rsp,8 and faster on Intel CPUs with a stack engine, given the surrounding instructions
xor eax, 1 ; we know check_palin returns a _Bool, so flip it
ret
;;; Unused, but improved in case you do want to use it
section .rodata
inputstr db "mommom"
inputstr_len equ $-inputstr
msg_palin db "is palindrome",10,0
msg_nopalin db "is not palindrome" ; newline an terminating zero are on the next line, with their own label
msg_newline db 10,0
(inspiration for how to write some of main()
from gcc output on godbolt. Compiler output is often a good starting point.)
Tested and works:
$ yasm -Worphan-labels -gdwarf2 -felf64 palindrome.asm && gcc palindrome.o -o palindrome
$ ./palindrome '' && echo "true" || echo "false"
true
$ ./palindrome 1 && echo "true" || echo false
true
$ ./palindrome abccba && echo "true" || echo "false"
true # even size: pair of middle chars
$ ./palindrome abcdcba && echo "true" || echo "false"
true # odd size: single middle char
$ ./palindrome abcdeba && echo "true" || echo "false"
false
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb bab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb baa' && echo "true" || echo "false"
false
$ ./palindrome && echo "true" || echo "false"
Segmentation fault (core dumped) # main doesn't check argc
Case-insensitive:
As well as byte-reversing one chunk, force the alphabetic characters of both chunks to lower case before comparing. With ASCII, this is easy. That answer I linked even has an SSE vector implementation of case-flipping, which can be modified to just force all alphabetic characters to the same case (either replace the final pxor
with por
, or use a blend with the compare result as the control, instead of pandn
/por
.)
Unrolling:
As well as using displacements ([rsi+8]
, [rdi-8]
, etc) to save add/sub instructions, you can reduce the number of branches by ORing or ANDing some compare results together. This may or may not save anything. On a CPU that doesn't suffer from partial-register merging slowdowns at all (e.g. AMD), this might be a win:
.palin_loop:
xor eax,eax ; inside the loop to break the false dependency
xor edx,edx
movbe ... / cmp ... ; movbe is a load and endian swap in one insn. On Haswell it just saves code-size, not uops. On Silvermont it's a win
setne dl
movbe rcx, [rsi+rdi-16] / cmp [rsi+8], rcx
setne dh
shl edx, 16 ; DH merging takes a cycle on Haswell
... repeat setting dl/dh so edx holds 4 results from four 8B checks
movbe ... / cmp ...
setne al
movbe ... / cmp ...
setne ah
shl eax, 16
... repeat setting al/ah
or eax, edx ; or unroll twice as much, and use rax,rdx
jnz .not_palin
Perhaps a better strategy would be to or
some xor
results, like
movbe rdx, [rsi+rdi-8]
xor rdx, [rsi+8] ; all-zero if they match, otherwise not
movbe rcx, [rsi+rdi-16]
xor rcx, [rsi+8]
or rcx, rdx ; ZF=0 if there are any mismatches in either XOR result
With SSE, where compare results already are boolean 0/-1 elements, you can pand
to check that every element matched in both pcmpeqb
results. Then one pmovmskb
/ cmp eax, 0xffff
at the end can check multiple vectors of compare results. (See glibc memcmp
and strlen
unrolled loops for this and other tricks.)
AVX:
It looks like the loop structure from the byte-at-a-time version is idea,
and we use pshufb to reverse the order of a whole vector. The pmovmskb
and testing the bitmask is a pretty standard idiom for dealing with vector-compare results. It's faster than PTEST
, because PTEST
is 2 uops and can't macro-fuse.
DEFAULT rel
section .rodata
ALIGN 16
bytereverse_shuffle db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
section .text
ALIGN 16
global check_palindrome_AVX1
check_palindrome_AVX1:
cmp rdi, 32
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
vmovdqa xmm5, [bytereverse_shuffle]
;;ALIGN 16 ; NASM syntax doesn't have an equivalent for the GNU assembler's align-if-padding-will-be-less-than-N-bytes .p2align 4,,8 (4 meaning 2^4 for power-of-2 align)
;; loop alignment is less important for tiny loops that should fit in the loop buffer
.palin_loop:
vmovdqu xmm0, [rsi + rdi - 16]
vpshufb xmm0, xmm0, xmm5 ; bswap
;;; for AVX2 ymm: vperm2i128 ymm0, ymm0, ymm0, 1 ; swap hi and lo lanes
vpcmpeqb xmm0, xmm0, [rsi] ; all we're gaining from AVX is folding this potentially-unaligned memory operand. Everything else works with SSE
vpmovmskb eax, xmm0
cmp eax, 0xffff ; check that all 16bits are set -> all 16bytes matched
jne .not_palin
add rsi, 16 ; pointer moves forward
sub rdi, 32 ; index counts down towards zero twice as fast
jae .palin_loop ; not ja: we still need to check 16 reversed bytes against the same 16 bytes if we exactly meet in the middle
;;; fused-domain uops: 7 for Haswell, 8 for SnB, so it should run one iteration per 2c
;;; An SSSE3 version that needs an unaligned load separate from pcmpeq will be 8 uops (or 9 on pre-Haswell)
.is_palin:
mov eax, 1
ret
.not_palin:
xor eax,eax
ret