How can I check if a character is a letter in assembly?

Question

So, I have a block of code which sets the bounders to check if a character is a letter (not numbers, not symbols), but I don't think it works for the characters in between upper and lower case. Can you help? Thanks!

mov al, byte ptr[esi + ecx]; move the first character to al
cmp al, 0                  ; compare al with null which is the end of string
je done                    ; if yes, jump to done
cmp al, 0x41               ; compare al with "A" (upper bounder)
jl next_char               ; jump to next character if less
cmp al, 0x7A               ; compare al with "z" (lower bounder)
jg next_char               ; jump to next character if greater
//do something if it's a letter
next_char:
//do something different

Have you single-stepped through it in a debugger? See http://stackoverflow.com/tags/x86/info for some stuff on getting started debugging, and other links. Checking this condition in asm is no different than checking it in C. — Peter Cordes, Aug 05 '15 at 06:02
Yes you did, but apparently my comment wasn't clear either. Your code refers to 2 strings, and it looks like you're trying to compare somthing (the 2 strings?), and also conditionally "jumping" in some places, but the full logic isn't clear. — Amit, Aug 05 '15 at 06:03
FWIW `; move al to the first character` should be `; move first character to al`. — Rudy Velthuis, Aug 05 '15 at 17:30

score 5 · Answer 1 · answered Aug 05 '15 at 06:20

5

You may or 0x20 to each character; this will make upper-case letters lower-case (and replace non-letter characters by other non-letter characters):

...
je done       ; This is your existing code
or al, 0x20   ; <-- This line is new!
cmp al, 0x41  ; This is your existing code again
...

Note: If your code should work with letters above 0x7F (like "Ä", "Ó", "Ñ") it would become very complex. One problem in this case would be that the ASCII code of these characters is different in Windows console programs (Example: "Ä" = 0x8E) and Windows GUI programs ("Ä" = 0xC4) and may be even different in other operating systems...

answered Aug 05 '15 at 06:20

Martin Rosenau

17,897
3
19
38

2

That's a cool trick, but you missed a spot ;-). After the `or` the comparison should be against 0x61. – Amit Aug 05 '15 at 06:27
@Amit: or write it as `'a'`. Alternatively, convert to uppercase with `and al, ~0x20`. Neither one can replace the compare-with-zero as the flag-setting op before the `je done`, unfortunately. The `and` would convert space (`' '`) to 0, and the result of the `or` is never zero. – Peter Cordes Aug 05 '15 at 06:33

Peter Cordes · Answer 2 · 2023-03-03T04:16:06.377

Correct, there's a gap of a few non-alphabetic characters between 'Z' and 'a'.

The most efficient way is to set the lower-case bit with an OR, then use the range-check trick of sub + unsigned compare. This of course only works for ASCII, not extended character sets where there are other ranges of alphabetic characters. Note that or al, 0x20 can never create a lower-case character if the original wasn't an upper-case character, because the ranges are "aligned" the same relative to a mod 32 boundary of ASCII codes.

Arrange your loop structure with the conditional branch at the bottom. Either enter the loop with a jmp to that load and test, or peel that part of the first iteration. (Why are loops always compiled into "do...while" style (tail jump)?)

Use movzx loads to avoid a false dependency on merging a low byte into EAX when writing AL.

 ; ESI = pointer to the string
    xor    ecx, ecx            ; index = 0
    movzx  eax, byte ptr[esi]  ; test first character
    test   eax, eax
    jz    .done                ; skip the loop on empty string
 ; alternative: jmp .next_char to enter the loop
.loop:                         ; do{
    inc    ecx

    mov    edx, eax               ; save a copy of the original if needed
;;;; THESE 4 INSTRUCTIONS ARE THE ALPHA / NON-ALPHA TEST
    or     al, 0x20               ; force lowercase
    sub    al, 'a'                ; AL = 0..25 if alphabetic
    cmp    al, 'z'-'a'
    ja    .non_alphabetic         ; unsigned compare rejects too high or too low (wrapping)

;; do something if it's a letter
    jmp   .next_char
.non_alphabetic:
;; do something different, then fall through

.next_char:
    movzx  eax, byte ptr[esi + ecx]
    test   eax, eax
    jnz    .loop                 ; }while((AL = str[i]) != 0);

.done:

If the input is before 'a', sub al, 'a' will be signed negative, or as unsigned will wrap to a high value, so cmp al, 'z'-'a' / ja will reject it.

If the input is after 'z', sub al, 'a' will leave a value higher than 25 ('z'-'a'), so the unsigned compare will reject it also.

Compilers use this unsigned compare trick when compiling a C expression like c <= 'z' && c >= 'a', so you can be sure it works the same as that expression for every possible input.

Other style notes: normally you'd just increment ESI, instead of having both a pointer and an index. Also, you may not need mov edx, eax if you can use the AL value (0-25 index into the alphabet). Making a copy and using this "destructive" test is usually better than 2 separate branches.

NASM syntax allows character constants like C, so you can write 0x41 as 'A', or 0x7A as 'z'. e.g. cmp al, 'a'. Then you don't even need to comment the line.

Writing it that way (with the next_char label at the top of the loop) saves a jmp at the bottom. Fewer instructions in the loop = better. The only point of writing asm these days is performance, so it makes sense to learn good techniques like this from the start, if it's not too confusing. No assembly answer would be complete without a link to http://agner.org/optimize/.

output of ascii(1), or http://www.asciitable.com/

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex  
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

thanks! I will edit it. So, do you know how can I check for the characters in between "A" and "z". I tried to set up 4 bounders, but I think it does not work that way. — conchimnon, Aug 05 '15 at 06:10
@conchimnon: A good starting point would be C compiler output for a C function that does what you want. Writing in ASM directly is a lot of work; using a decent starting point before starting to optimize your code in asm can save time. (Or sometimes you see the compiler is doing something inefficient.) See http://agner.org/optimize/ for more suggestions about using asm. — Peter Cordes, Aug 05 '15 at 06:17

Amit · Accepted Answer · 2015-08-05T06:28:41.373

You need to have a logic that combines multiple conditions similar to what would be a "C" statement: if((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z'))

You can do that like this:

...
je done                    ; if yes, jump to done
cmp al, 0x41               ; compare al with "A"
jl next_char               ; jump to next character if less
cmp al, 0x5A               ; compare al with "Z"
jle found_letter           ; if al is >= "A" && <= "Z" -> found a letter
cmp al, 0x61               ; compare al with "a"
jl next_char               ; jump to next character if less (since it's between "Z" & "a")
cmp al, 0x7A               ; compare al with "z"
jg next_char               ; above "Z" -> not a character
found_letter:
// ...
next_char:
// ...

This naive direct translation of C to asm is correct but inefficient. It's possible with 4 instructions, only 1 of them a branch; see my answer. — Peter Cordes, Sep 22 '20 at 21:10

score -1 · Answer 4 · answered Sep 22 '20 at 17:16

This function takes a string, and uses ascii table values to determine if it is an upper case char or lower case char. The CMP-->BLS and CMP-->BLI instructions are what determine if it's an upper or lower case char. The code that comes afterwards capitalizes the char if it is a lower case char.

__asm void my_capitalize(char *str)
{
cap_loop
        LDRB r1, [r0] ; Load byte into r1 from memory pointed to by r0 (str pointer)
        CMP r1, #'a'-1 ; compare it with the character before 'a'
        BLS cap_skip ; If byte is lower or same, then skip this byte
        CMP r1, #'z' ; Compare it with the 'z' character
        BHI cap_skip ; If it is higher, then skip this byte
        SUBS r1,#32 ; Else subtract out difference to capitalize it
        STRB r1, [r0] ; Store the capitalized byte back in memory
cap_skip
        ADDS r0, r0, #1 ; Increment str pointer
        CMP r1, #0 ; Was the byte 0?
        BNE cap_loop ; If not, repeat the loop
        BX lr ; Else return from subroutine
}

This code only checks for lower-case. It doesn't detect upper-case alphabetic characters in the source string. I just updated my answer with the efficient range check trick. — Peter Cordes, Sep 22 '20 at 18:56

How can I check if a character is a letter in assembly?

4 Answers4

Linked