X86 NASM Assembly converting lower to upper and upper to lowercase characters

Question

As i am pretty new to assembly, i have a few questions in regards to how i should convert from a lowercase to an uppercase if the user enters an uppercase letter or vice versa in assembly. here is what i have so far:

section .data
Enter db "Enter: "
Enter_Len equ $-Enter

Output db "Output: "
Output_Len equ $-Output

Thanks db "Thanks!"
Thanks_Len equ $-Thanks

Loop_Iter dd 0 ; Loop counter

section .bss
In_Buffer resb 2
In_Buffer_Len equ $-In_Buffer

section .text
global _start

_start:
    ; Print Enter message
    mov eax, 4 ; sys_write
    mov ebx, 1
    mov ecx, Enter
    mov edx, Enter_Len
    int 80h

    ; Read input
    mov eax, 3 ; sys_read
    mov ebx, 0
    mov ecx, In_Buffer
    mov edx, In_Buffer_Len
    int 80h

So basically, if i am correct, my edx contains the string entered. Now comes the dilemma of converting from lower to upper and upper to lowercase. As i am absolutely new to this, have literally no clue what to do. Any help would be much appreciated :)

Please note that I edited your post to fix the code formatting. Next time, please uses spaces rather than tabs. — Jim Mischel, Feb 27 '14 at 03:24
@JimMischel - Sorry about that brother, didn't realize it was messed up, my apologies. And thanks for the edit! :) — user2128074, Feb 27 '14 at 03:31
Related: [What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?](https://stackoverflow.com/a/54585515), and also [How to access a char array and change lower case letters to upper case, and vice versa](https://stackoverflow.com/a/35936844) for an efficient asm loop. (including a SIMD version.) — Peter Cordes, Jun 05 '20 at 14:13

Alexis Wilke · Answer 1 · 2016-04-20T22:23:28.233

6

If you only support ASCII, then you can force lowercase using an OR 0x20

  or   eax, 0x20

Similarly, you can transform a letter to uppercase by clearing that bit:

  and  eax, 0xBF   ; or use ~0x20

And as nneonneo mentioned, the character case can be swapped using the XOR instruction:

  xor  eax, 0x20

That only works if eax is between 'a' and 'z' or 'A' and 'Z', so you'd have to compare and make sure you are in the range:

  cmp  eax, 'a'
  jl   .not-lower
  cmp  eax, 'z'
  jg   .not-lower
  or   eax, 0x20
.not-lower:

I used nasm syntax. You may want to make sure the jl and jg are correct too...

If you need to transform any international character, then that's a lot more complicated unless you can call a libc tolower() or toupper() function that accept Unicode characters.

As a fair question: why would it work? (asked by kuhaku)

ASCII characters (also ISO-8859-1) have the basic uppercase characters defined between 0x41 and 0x5A and the lowercase characters between 0x61 and 0x7A.

To force 4 into 6 and 5 into 7, you force bit 5 (0x20) to be set.

To go to uppercase, you do the opposite, you remove bit 5 so it becomes zero.

edited Apr 20 '16 at 22:23

answered Feb 27 '14 at 04:17

Alexis Wilke

19,179
10
84
156

You have a broken mix of NASM and AT&T syntax. You're missing the `$` on the immediate operands. The OP's code is in Intel/NASM syntax anyway, so that would be been sensible. Anyway, you can write `and eax, ~0x20` to make the code clearer. – Peter Cordes Dec 23 '15 at 04:15
Okay, I tested with nasm to make sure and made corrections. I did not try to run the code so I do not know whether the `jl` and `jg` are really correct. For the `0xBF`, we often use `0x5F` too... which is not `~0x20`. – Alexis Wilke Dec 23 '15 at 04:39
Just noticed: `~0x20` fits in a sign-extended imm8, but 0xBF will produce `and r32, imm32` (because it's outside the [-128 .. 127] range). So `~0x20` is more efficient, like `0xFFFFFFBF`. Masking down to 7-bit ASCII with `0x5F` will also be efficient. I think your branches look right. I might have named the label `.not_lowercase`, to avoid confusion with `jl`. – Peter Cordes Dec 23 '15 at 04:40
Why would it work with those specific hex values? How did you get them? – shinzou Apr 20 '16 at 22:18
@kuhaku, I updated my question with information about why it works. A.Danischewski below shows another way to do it by using ADD and SUB. Only in his case he needs to know whether it is a lower or upper case letter beforehand. If you know you only have letters, the OR and AND work without the need to know beforehand. – Alexis Wilke Apr 20 '16 at 22:26
Oh I see, very cool although I think you got it wrong with the bit, since 0x20 = 0b100000 then it's the sixth bit and not the fifth. Thanks for the explanation. – shinzou Apr 20 '16 at 22:52
2

@kuhaku Well... I say "bit 5". You generally count bits starting at zero... it's always a bit of a gray area, I know... So "bit 5" is the 6th bit. 8-) https://en.wikipedia.org/wiki/Bit_Test – Alexis Wilke Apr 20 '16 at 23:08

score 3 · Answer 2 · edited Jan 22 '15 at 18:11

Okay, but your string is not in edx, it's in [ecx] (or [In_Buffer]) (and it's only one useful character). To get a single character...

mov al, [ecx]

In a HLL you do "if some condition, execute this code". You might wonder how the CPU knows whether to execute the code or not. What we really do (HLLs do this for you) is "if NOT condition, skip over this code" (to a label). Experiment with it, you'll figure it out.

Exit cleanly, whatever path your code takes. You don't show this, but I assume you do it.

I just posted some info on sys_read here.

It's for a completely different program (adding two numbers - "hex" numbers) but the part about sys_read might interest you...

score 0 · Answer 3 · answered Feb 27 '14 at 03:05

0

Cute trick: if they type only letters, you can XOR their input letters with 0x20 to swap their case.

Then, if they can type more than letters, you just have to check each letter to see if it is alphabetical before XORing it. You can do that with a test to see if it lies in the ranges 'a' to 'z' or 'A' to 'Z', for example.

Alternately, you can just map each letter through a 256-element table which maps the characters the way you want them (this is usually how functions like toupper are implemented, for example).

answered Feb 27 '14 at 03:05

nneonneo

171,345
36
312
383

The thing is, the user can either input a lowercase letter, an uppercase letter, a number, or a special character. As i have java background, could have just used if statements and be done with it, however, being COMPLETELY unfamiliar with Assembly, have no idea how to even compare the input being and stuff – user2128074 Feb 27 '14 at 03:10
Why does it work with `020h` and how did they get to it? – shinzou Apr 20 '16 at 22:21
@shinzou: [What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa?](https://stackoverflow.com/a/54585515) – Peter Cordes Jun 05 '20 at 14:16

score 0 · Answer 4 · 2015-12-19T05:45:06.253

Here is a NASM program I hacked together that flips the case of a string, you basically need to loop over the string and check each character for boundaries in ascii and then add or subtract 0x20 to change the case (that is the distance between upper and lower in ascii). You can use the Linux ascii command to see a table of ascii values.

File: flipcase.asm

section     .text
global      _start                 ; Entry point for linker (ld)

  ; Linker entry point                                
_start:                                                         
    mov     rcx,len                ; Place length of message into rcx
    mov     rbp,msg                ; Place address of our msg into rbp    
    dec     rbp                    ; Adjust count to offset

  ; Go through the buffer and convert lowercase to uppercase characters:
upperScan:
    cmp byte [rbp+rcx],0x41        ; Test input char against uppercase 'A'                 
    jb lowerScan                   ; Not uppercase Ascii < 0x41 ('A') - jump below
    cmp byte [rbp+rcx],0x5A        ; Test input char against uppercase 'Z' 
    ja lowerScan                   ; Not uppercase Ascii > 0x5A ('Z') - jump above  
     ; At this point, we have a uppercase character
    add byte [rbp+rcx],0x20        ; Add 0x20 to get the lowercase Ascii value
    jmp Next                       ; Done, jump to next

lowerScan:
    cmp byte [rbp+rcx],0x61        ; Test input char against lowercase                 
    jb Next                        ; Not lowercase Ascii < 0x61 ('a') - jump below
    cmp byte [rbp+rcx],0x7A        ; Test input char against lowercase 'z'
    ja Next                        ; Not lowercase Ascii > 0x7A ('z') - jump below  
     ; At this point, we have a lowercase char
    sub byte [rbp+rcx],0x20        ; Subtract 0x20 to get the uppercase Ascii value
     ; Fall through to next        

Next:   
    dec rcx                        ; Decrement counter
    jnz upperScan                  ; If characters remain, loop back

  ; Write the buffer full of processed text to stdout:
Write:        
    mov     rbx,1                  ; File descriptor 1 (stdout)    
    mov     rax,4                  ; System call number (sys_write)
    mov     rcx,msg                ; Message to write        
    mov     rdx,len                ; Length of message to write
    int     0x80                   ; Call kernel interrupt
    mov     rax,1                  ; System call number (sys_exit)
    int     0x80                   ; Call kernel

section     .data

msg     db  'hELLO, wwwoRLD!',0xa  ; Our dear string
len     equ $ - msg                ; Length of our dear string

Then you can compile and run it with:
$> nasm -felf64 flipcase.asm && ld -melf_x86_64 -o flipcase flipcase.o && ./flipcase

Don't use `int 0x80` in 64-bit code: [What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?](https://stackoverflow.com/q/46087730) — Peter Cordes, Jun 05 '20 at 14:11

score 0 · Answer 5 · answered Dec 22 '15 at 23:14

Jeff Duntemann wrote a book called Assembly Language Step by Step programming with linux .. which covers this topic very well on page 275 - 277.

there he shows by using the code sub byte [ebp+ecx], 20h you can then change lower-case to upper-case , please note that the buffer is using 1024 bytes which is a faster and better way to do this then the previous example located on page 268-269 where the buffer only has 8 bits at a time.

why is `020h` special? how did they get it? – shinzou Apr 20 '16 at 22:19 — shinzou, Apr 20 '16 at 22:19

X86 NASM Assembly converting lower to upper and upper to lowercase characters

5 Answers5

Linked

Related