1

i'm confused on how characters work in assembly. Since everything is in bits then does that mean that all characters are basically just hexadecimal versions/ascii values when in registers?

For example if I wanted to put "I like pie" in a register will it show up as "48h 4C484B45h 504945h"? And if I wanted to get all "i" letters will I need to use a command like

;   "command to get string input"  
getI:
    cmp BYTE PTR [edx],48h 
    je L1
L1: 
    add [edx],1
    loop getI  

I'm basically trying to find a way to isolate a character from an inputted string.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
mend0k
  • 481
  • 5
  • 12
  • 1
    Note, spaces are also characters. In ASCII encoding they have the ASCII code 32 (0x20). Also, you typically never place the actual string _contents_ in a register. In your example it certainly wouldn't work, because `"I like pie"` is 10 characters (i.e. 10 bytes), which is too large to fit in a register (if we discount the SSE/AVX registers). So what you do is to have the string data in the data section, or on the heap or stack if it's a dynamically allocated string, and what you place in the register is the _address of_ that string data. – Michael Mar 31 '16 at 05:40
  • Also, 48h in ASCII is `'H'`. `'i'` is 69h, and `'I'` is 49h. – Michael Mar 31 '16 at 05:43

2 Answers2

2

your approach was close to working:

getI:
    cmp BYTE PTR [edx],'i'  ; your assembler will know what that is
    jne L1                  ; skip the code for 'i' handling, if it's NOT an 'i'
    ; here you do what
    ; ever you want to 
    ; with your 'i's here
L1: 
    add edx,1               ; better: 'inc edx'
    loop getI               ; loop only works, if you have CX loaded with the length of your "string"
                            ; either use it, or check for 0 chars INSIDE the loop

your CPU doesnt care for your "String", it's just an amount of bytes in memory.
Your "i like pie" is the ascii representation of the bytes "69h,20h,6ch,69h,6bh,65h,20h,70h,69h,65h,0" when you load one of these into a register, it's just a byte value ... there's nothing special about strings, or chars

btw: not all register are equal, there are some that have special purposes. You've chosen edx as your "string pointer" ... x86 CPUs have index registers with specialized instructions to use them, which would've served you better. But that's a different story ;-)

Tommylee2k
  • 2,683
  • 1
  • 9
  • 22
  • Thanks alot, I understand it better now, but for the "add [edx],1" it should've been "add edx,1". So if it's just a single quote it refers to the ASCII table? say I put a '1' instead of 'i' will it look for the ASCII version of 1 instead of the numerical? – mend0k Apr 02 '16 at 04:31
  • you're right with edx, ill fix it :) (didnt check what i copied from you :P ) again: your CPU doesnt care for ascii. so if you want to check for ascii '1', do it with `cmp '1'` , a check for binary 1 is `cmp 1` – Tommylee2k Apr 02 '16 at 07:19
1

does that mean that all characters are basically just hexadecimal versions/ascii values when in registers

Yep, just like in C, a string is just an array of char elements. char is just a narrow integer type, like uint8_t. (The C standard doesn't define whether char is signed or not, so use uint8_t if you're actually using C and want unsigned variables.)

Anyway, yes, in asm you should treat characters in an ASCII string as 8bit integers, where the value is their ASCII encoding. e.g. if ax has a value from [0..9], you can convert it to a decimal digit with:

add   al, '0'    ; '0' is a nice way to write 0x30

Converting larger integers to/from multi-digit decimal strings is obviously more complicated, and requires dividing or multiplying by 10.


If you have an alphabetic ASCII character in a register, you can make force it to lower case, upper case, or flip it with

or    al, 0x20    ; tolower
and   al, ~0x20   ; toupper
xor   al, 0x20    ; opposite

'a' - 'A' is 0x20, and the ranges don't cross a 0x20 boundary within the coding space. (i.e. all lowercase letters have that bit set, and none of the uppercase letters do). To check if an ascii character is a letter or not, see my answer on a question about case-flipping for letters only


If I put "I like pie" in a register...

Processing multiple characters at a time in a register is tricky, but essential for high performance. It's easier to write code that just keeps a pointer to a position in a string, and processes a byte at a time.

NASM and MASM have syntax for multi-character constants, so you can do stuff like

mov   eax, 'abcd'    ; put those bytes into eax, like if you'd loaded from  db 'a', 'b', 'c', 'd'

Unless you know the string length at assemble time, you won't know how many registers it will take to hold the whole thing. Normally you'd load 4 or 8 chars at a time into a register and use some SWAR bithack to do something like check if any of the bytes are zero, if you were implementing strlen() for example.

If you know all the chars are ASCII letters, you can force them all to lowercase with

mov  eax, [rsi]
or   eax, 0x20202020    ; packed tolower on all 4 chars

Your loop

You're trying to implement strchr(const char *s, int c)?

getI:
    cmp BYTE PTR [edx],48h      ; so far so good
    je L1                       ; goes to L1 whether the branch is taken or not
L1: 
    add [edx],1                 ; increments the data pointed to by edx, not edx itself.
    loop getI                   ; If ECX is the string length, this is ok.  also loop is slow, don't use

Also, add [edx],1 won't assemble because the operand-size is ambiguous. It could be add byte [edx], 1, word, or add dword [edx], 1. You probably meant inc edx

If you know the character is present, you don't need to check the length (or for a terminating zero byte in a C-style implicit-length string).

e.g.

unsafe_strchr:
   ; ... get the char you want in al, and the pointer in esi
.search:
    inc    esi
    cmp    byte [esi-1], al
    jne  .search
; esi holds to the location
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847