does that mean that all characters are basically just hexadecimal versions/ascii values when in registers
Yep, just like in C, a string is just an array of char
elements. char
is just a narrow integer type, like uint8_t
. (The C standard doesn't define whether char
is signed or not, so use uint8_t
if you're actually using C and want unsigned variables.)
Anyway, yes, in asm you should treat characters in an ASCII string as 8bit integers, where the value is their ASCII encoding. e.g. if ax
has a value from [0..9]
, you can convert it to a decimal digit with:
add al, '0' ; '0' is a nice way to write 0x30
Converting larger integers to/from multi-digit decimal strings is obviously more complicated, and requires dividing or multiplying by 10.
If you have an alphabetic ASCII character in a register, you can make force it to lower case, upper case, or flip it with
or al, 0x20 ; tolower
and al, ~0x20 ; toupper
xor al, 0x20 ; opposite
'a' - 'A'
is 0x20, and the ranges don't cross a 0x20 boundary within the coding space. (i.e. all lowercase letters have that bit set, and none of the uppercase letters do). To check if an ascii character is a letter or not, see my answer on a question about case-flipping for letters only
If I put "I like pie" in a register...
Processing multiple characters at a time in a register is tricky, but essential for high performance. It's easier to write code that just keeps a pointer to a position in a string, and processes a byte at a time.
NASM and MASM have syntax for multi-character constants, so you can do stuff like
mov eax, 'abcd' ; put those bytes into eax, like if you'd loaded from db 'a', 'b', 'c', 'd'
Unless you know the string length at assemble time, you won't know how many registers it will take to hold the whole thing. Normally you'd load 4 or 8 chars at a time into a register and use some SWAR bithack to do something like check if any of the bytes are zero, if you were implementing strlen()
for example.
If you know all the chars are ASCII letters, you can force them all to lowercase with
mov eax, [rsi]
or eax, 0x20202020 ; packed tolower on all 4 chars
Your loop
You're trying to implement strchr(const char *s, int c)
?
getI:
cmp BYTE PTR [edx],48h ; so far so good
je L1 ; goes to L1 whether the branch is taken or not
L1:
add [edx],1 ; increments the data pointed to by edx, not edx itself.
loop getI ; If ECX is the string length, this is ok. also loop is slow, don't use
Also, add [edx],1
won't assemble because the operand-size is ambiguous. It could be add byte [edx], 1
, word, or add dword [edx], 1
. You probably meant inc edx
If you know the character is present, you don't need to check the length (or for a terminating zero byte in a C-style implicit-length string).
e.g.
unsafe_strchr:
; ... get the char you want in al, and the pointer in esi
.search:
inc esi
cmp byte [esi-1], al
jne .search
; esi holds to the location