Detecting whether a character is between 'a'
and 'z'
only takes one sub
and one cmp
to do the range-check. (See What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? for details.)
We can leave all character unmodified except ones that were originally lower-case letters. In ARM mode, we can easily predicate the store (acts as a NOP if the condition is false). Assuming CPUs handle this efficiently, it won't dirty the cache for strings with no lowercase characters. (@artless noise's answer does this, too, by jumping back to the top of the loop before reaching the store.)
.syntax unified
@ call this with address of string in R0
upperString_ARM_mode:
b .Lentry @ start in the middle of the loop. Or put upperString: there instead of here.
.Lloop: @ do {
sub r2, r1, #'a'
bic r1, #0x20 @ clear the lower-case bit in the original
cmp r2, #'z'-'a' @ set flags
it ls @ For Thumb2 compat; assembles to nothing in ARM mode
strbls r1, [r0, #-1] @ strb with LS predicate (Lower-or-Same unsigned <=)
@ store upcased version if (c-'a') <=(unsigned) length of alphabet
.Lentry:
ldrb r1, [r0],#1 @ zero-extending byte load (with post-increment addressing)
tst r1, r1
bne .Lloop @ }while( *p != 0 )
bx lr @ return. (R0 pointing at terminating 0 byte)
@@@ UNTESTED, except for checking that it assembles for both ARM and Thumb-2
@@@ Doesn't work for Thumb-1
Instead of starting with a b .Lentry
, you can just put the upperString
label in the middle of the loop, so calls to it made with bx upperString
start in the middle of the loop. (Usually function labels are at the top of a function, but if not then any tools that assume that will see the preceding code as part of a different function).
Re-arranging the loop to have the conditional branch at the bottom (and no unconditional branch) is called "loop rotation" optimization; that's why we have to start in the middle.
Unfortunately Thumb mode cbnz
can only jump forwards, so you can't use it as the loop branch.
This version of the function has fewer instructions in the loop than @artless noise's (7 vs. 10), but they all run every time. This is good for branch prediction, but may be worse on simple low-end CPUs that don't depend much on that.
This assembles in ARM or Thumb-2 (e.g. with arm-none-eabi-gcc -c -mcpu=cortex-m3
), but won't work on CPUs with only Thumb-1. (e.g. cortex-m0).
sub
with a destination register different from the source, and an immediate constant that large, doesn't fit in a single narrow 16-bit instruction, neither subs nor sub. Neither does strb
with an [r0, #-1]
addressing mode.
sub/cmp gets the job done in 2 instructions. For some conditions, you can use cmp
/ cmpXX
(with some predicate) to leave the flags set in some useful way. But here, cmp r1, #'a'
/ cmphs r1, #'z'
would make the LS condition true even if r1<'a'
. So one of the instructions has to be rsbs
to reverse-subtract, or you need one of the constants in a register so you can do cmp r1, #'a'
/ cmphs r2, r1
to get consistent flag-conditions without modifying any registers.
You could of course do this much faster with NEON SIMD instructions, 8 or 16 bytes at a time, especially if you know the length instead of also having to search for a terminating 0 byte. See Convert a String In C++ To Upper Case for an x86 SSE2 version.