I'm working on a string unescaping function that converts literal sequences like \uxxxx
(where xxxx
is a hex value) into bytes of corresponding value. I am planning to have the function get he first two characters of the xxxx
sequence, calculate the byte value, and to the same with the second sequence.
But I ran into an unexpected result with literal typed UTF-8 characters. The following illustrates my issue:
#include <stdio.h>
int main()
{
unsigned char *str1 = "abcĢ";
unsigned char *str2 = "abc\x01\x22";
for (unsigned i = 0; i < 5; i++)
printf ("String 1 character #%u: %x\n", i, str1[i]);
for (unsigned i = 0; i < 5; i++)
printf ("String 2 character #%u: %x\n", i, str2[i]);
return 0;
}
Output:
String 1 character #0: 61
String 1 character #1: 62
String 1 character #2: 63
String 1 character #3: c4
String 1 character #4: a2
String 2 character #0: 61
String 2 character #1: 62
String 2 character #2: 63
String 2 character #3: 1
String 2 character #4: 22
Unicode character Ģ
has e hex value of \x0122
, so I would expect bytes #3 and #4 to be \x01
andx22
respectively.
Where do c4
and a2
come from? I guess I am not understanding how multi-byte characters in strings are encoded in C. Any help would be appreciated.