Arm Assembly Rasperry-Pi: Converting a string to Upper case

Question

I´m working on a program in which the user enters his name, and the program should convert all lower case letters to upper case:

I am using the %s format to read the string:

.text
 ldr r0,=msj
 bl printf
 ldr r0,=format
 ldr r1,string
 bl scanf



.data
.align 2
msj: .asciz "Enter you name:  "
format: .asciz "%s"
string: .asciz ""

I have tried substracting 32 to each character but I think the strings are not in ascii numbers format.

Is there any way I can convert the entire word to Upper Case?

You did not reserve any space for `string`. It is a single `'\0'` character. Try `string: .fill 256,1,0` or something sufficient. — artless noise, May 17 '13 at 00:16

artless noise · Answer 1 · 2021-02-21T12:31:55.597

This might work. I don't have any of my ARM material with me at the moment.

; call with address of string in 'R0'.
upperString:
1: ldrb r1,[r0],#1
   tst  r1      ; finished string with null terminator?
   bxeq lr      ; then done and return
   cmp  r1,#'a' ; less than a?
   blo  1b      ; then load next char.
   cmp  r1,#'z' ; greater than z?
   bhi  1b      ; then load next char.

   ; Value to upper case.
   sub  r1,r1,#('a' - 'A') ; subtract 32.
   strb r1,[r0,#-1] ; put it back to memory.
   b    1b      ; next character.

At least it is a good starting point. This is like wallyk's code, except I have assumed a null-terminated string instead of a pascal type string. To call it,

   ldr r0,=string
   bl  upperString

Variants

Above is for a 'C' formatted 'NULL' (zero value) terminated ASCII string as per the .asciz pseudo-op. Another format of the string encoding is the Pascal type. A Pascal string is figuratively int size; char data[size] and there is no null terminator. The loop mechanics will be different for a pascal string, but the core (xor 0x20 or sub 'a' - 'A') is the same for ASCII encodings.

Some string encodings are different. For fixed width strings, the constants will change. Some strings use escape mechanisms and each 'glyph' or letter is represented by a different amount of data. The 'stepping' assembler changes in this case.

Finally, with a 'C' library you often want to know, is this a number, is this a punctuation, etc. In these case, a table for each character which has the property for that character can be indexed. You might also use this table approach, if the encoding for 'Upper' and 'Lower' case is not a contiguous range.

Hopefully the variants section is useful for non-'cut and paste' programmers.

score 1 · Answer 2 · answered Feb 20 '21 at 05:07

Detecting whether a character is between 'a' and 'z' only takes one sub and one cmp to do the range-check. (See What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? for details.)

We can leave all character unmodified except ones that were originally lower-case letters. In ARM mode, we can easily predicate the store (acts as a NOP if the condition is false). Assuming CPUs handle this efficiently, it won't dirty the cache for strings with no lowercase characters. (@artless noise's answer does this, too, by jumping back to the top of the loop before reaching the store.)

.syntax unified
@ call this with address of string in R0
upperString_ARM_mode:
   b     .Lentry          @ start in the middle of the loop.  Or put upperString: there instead of here.
.Lloop:                   @ do {
   sub    r2, r1, #'a'
   bic    r1, #0x20         @ clear the lower-case bit in the original
   cmp    r2, #'z'-'a'      @ set flags

   it  ls                   @ For Thumb2 compat; assembles to nothing in ARM mode 
   strbls   r1, [r0, #-1]   @ strb with LS predicate (Lower-or-Same unsigned <=)
                            @ store upcased version if (c-'a') <=(unsigned) length of alphabet
 .Lentry:
   ldrb   r1, [r0],#1       @ zero-extending byte load (with post-increment addressing)
   tst    r1, r1
   bne    .Lloop          @ }while( *p != 0 ) 

   bx    lr          @ return.  (R0 pointing at terminating 0 byte)

@@@ UNTESTED, except for checking that it assembles for both ARM and Thumb-2
@@@ Doesn't work for Thumb-1

Instead of starting with a b .Lentry, you can just put the upperString label in the middle of the loop, so calls to it made with bx upperString start in the middle of the loop. (Usually function labels are at the top of a function, but if not then any tools that assume that will see the preceding code as part of a different function).

Re-arranging the loop to have the conditional branch at the bottom (and no unconditional branch) is called "loop rotation" optimization; that's why we have to start in the middle.

Unfortunately Thumb mode cbnz can only jump forwards, so you can't use it as the loop branch.

This version of the function has fewer instructions in the loop than @artless noise's (7 vs. 10), but they all run every time. This is good for branch prediction, but may be worse on simple low-end CPUs that don't depend much on that.

This assembles in ARM or Thumb-2 (e.g. with arm-none-eabi-gcc -c -mcpu=cortex-m3), but won't work on CPUs with only Thumb-1. (e.g. cortex-m0).

sub with a destination register different from the source, and an immediate constant that large, doesn't fit in a single narrow 16-bit instruction, neither subs nor sub. Neither does strb with an [r0, #-1] addressing mode.

sub/cmp gets the job done in 2 instructions. For some conditions, you can use cmp / cmpXX (with some predicate) to leave the flags set in some useful way. But here, cmp r1, #'a' / cmphs r1, #'z' would make the LS condition true even if r1<'a'. So one of the instructions has to be rsbs to reverse-subtract, or you need one of the constants in a register so you can do cmp r1, #'a' / cmphs r2, r1 to get consistent flag-conditions without modifying any registers.

You could of course do this much faster with NEON SIMD instructions, 8 or 16 bytes at a time, especially if you know the length instead of also having to search for a terminating 0 byte. See Convert a String In C++ To Upper Case for an x86 SSE2 version.

score 0 · Answer 3 · answered May 16 '13 at 23:55

0

This is the essential algorithm:

for (int idx = 0;  idx < len;  ++idx)
    if (str [idx] >= 'A'  &&  str [idx] <= 'Z')
         str [idx] += 'a' - 'A';

It has several parts you don't have. Scanning the string character by character. Inspecting for an uppercase letter. Adding (not subtracting) the lowercase/uppercase offset.

Note that this won't work for Unicode in general.

answered May 16 '13 at 23:55

wallyk

56,922
16
83
148

Yes that´s exactly what I did, but when I print the string again it wont return de uppercase letter, it return random characters or just a blank space str [idx] += 'a' - 'A'; I don´t think there are assembly instructions to convert to upper case, usually I substract 32 when I have an ascci character (%c) but whith string (%s) it won´t work – Pablo Estrada May 17 '13 at 00:04
@PabloEstrada: Try adding 32 instead of subtracting. The lowercase letters have higher ASCII codes than the uppercase letters. Also, you didn't show all the related code. – wallyk May 17 '13 at 00:14

Arm Assembly Rasperry-Pi: Converting a string to Upper case

3 Answers3

Variants

Linked

Related