1

I'm working on a string unescaping function that converts literal sequences like \uxxxx (where xxxx is a hex value) into bytes of corresponding value. I am planning to have the function get he first two characters of the xxxx sequence, calculate the byte value, and to the same with the second sequence.

But I ran into an unexpected result with literal typed UTF-8 characters. The following illustrates my issue:

#include <stdio.h>

int main()
{
    unsigned char *str1 = "abcĢ";
    unsigned char *str2 = "abc\x01\x22";
    for (unsigned i = 0; i < 5; i++)
        printf ("String 1 character #%u: %x\n", i, str1[i]);
    for (unsigned i = 0; i < 5; i++)
        printf ("String 2 character #%u: %x\n", i, str2[i]);

    return 0;
}

Output:

String 1 character #0: 61
String 1 character #1: 62
String 1 character #2: 63
String 1 character #3: c4
String 1 character #4: a2
String 2 character #0: 61
String 2 character #1: 62
String 2 character #2: 63
String 2 character #3: 1
String 2 character #4: 22

Unicode character Ģ has e hex value of \x0122, so I would expect bytes #3 and #4 to be \x01 andx22 respectively.

Where do c4 and a2 come from? I guess I am not understanding how multi-byte characters in strings are encoded in C. Any help would be appreciated.

user3758232
  • 758
  • 5
  • 19
  • 2
    `c4`,`a2` is UTF-8 byte sequence for `Ģ` (U+0122) _Latin Capital Letter G With Cedilla_ – JosefZ Jan 15 '21 at 16:51
  • Then my question is, what is the logic to convert `c4` `a2` into `0122` and back? – user3758232 Jan 15 '21 at 16:55
  • And, of course, if someone knows of a ready-made function that converts a literal `"\\u0122"` into `"\u0122"` that would resolve my problem very nicely... Not trying to reinvent the wheel. – user3758232 Jan 15 '21 at 17:01
  • UTF8 is not only the ordinal of the character, it also encodes the number of bytes that is needed to represent this character. – Matheus Rossi Saciotto Jan 15 '21 at 17:18
  • Your compiler might support `"\u0122"` as a 4-digit Unicode escape sequence; doubling up the backslash is not productive unless your compiler doesn't support the Unicode escape sequences, but if it doesn't support them, using the notation doesn't help. For a start byte in the range 0xC2..0xDF (a 2-byte UTF8 character), you need `codepoint = ((utf8[0] & 0x1F) << 6) + (utf8[1] & 0x3F);`. That assumes that `utf8[1]` is in the range 0x80..0xBF. – Jonathan Leffler Jan 15 '21 at 17:18
  • I guess that the conversion is not trivial but this snippet uncovers the logic: https://gist.github.com/MightyPork/52eda3e5677b4b03524e40c9f0ab1da5 – user3758232 Jan 15 '21 at 17:23
  • @JonathanLeffler I am not in control of the double backslash. It comes from user input containing characters with a specific escape sequence that I have to convert to C-native. – user3758232 Jan 15 '21 at 17:26
  • 1
    OK; then you'll need to write code that demands a backslash, a lower-case U, and 4 hex-digits (either case), and you'll need to convert the 4 hex digits into a 16-bit unsigned integer, or some larger type (32-bit `int`, optionally `unsigned`). Not all that hard. Note that there must be 4 hex digits; it is a hard requirement; three or fewer is an invalid sequence with no unambiguous interpretation (error). Extra hex digits are irrelevant; they're outside the scope of the `\u` escape sequence (but must not be converted; so `strtol()` is not usable unless you extract the 4 digits into a string). – Jonathan Leffler Jan 15 '21 at 17:32
  • Note when using non-ASCII characters in a `char` string, the bytes that end up in the string depends on the encoding used for the source file. In this case, your source file must be saved in UTF-8 encoding. – Mark Tolonen Jan 15 '21 at 17:39
  • So you want to convert a user input sequence containing a mixture of UTF-8 and C style backslash escape sequences into a UTF-8 output sequence? – Ian Abbott Jan 15 '21 at 17:42
  • @IanAbbott The input is a file handle. I would encounter custom escape sequences representing UC code points as the one in my other comment. – user3758232 Jan 15 '21 at 18:11
  • https://en.wikipedia.org/wiki/UTF-8#Encoding – monkeyman79 Jan 15 '21 at 18:15
  • Here's some code to convert Unicode code points to UTF-8: https://stackoverflow.com/a/148766/5987. It's in C++ but wouldn't be hard to convert, especially if you're going one character at a time. – Mark Ransom Jan 15 '21 at 18:21
  • [UTF-8 to UTF-16 one-way conversion, written in C](https://gist.github.com/tommai78101/3631ed1f136b78238e85582f08bdc618) – JosefZ Jan 15 '21 at 19:31

2 Answers2

3

Unicode character Ģ has e hex value of \x0122, so I would expect bytes #3 and #4 to be \x01 and \x22 respectively.

Where do c4 and a2 come from?

In Unicode, Ģ is codepoint U+0122 LATIN CAPITAL LETTER G WITH CEDILLA, which in UTF-8 is encoded as bytes 0xC4 0xA2.

Either your source file is saved as UTF-8, or your compiler is configured to save string literals in UTF-8. Either way, in your str1 string, the literal Ģ is stored as UTF-8. Thus:

unsigned char *str1 = "abcĢ";

is roughly equivalent to this:

unsigned char literal[] = {'a', 'b', 'c', 0xC4, 0xA2, '\0'};
unsigned char *str1 = &literal[0];

In an escape sequence, the entire sequence represents a single numeric value. So, \x01 and \x22 represent the individual numeric values 0x01 hex (1 decimal) and 0x22 hex (34 decimal), respectively. Thus:

unsigned char *str2 = "abc\x01\x22";

is roughly equivalent to this:

unsigned char literal[] = {'a', 'b', 'c', 0x01, 0x22, '\0'};
unsigned char *str2 = &literal[0];

You are simply outputting the raw bytes of the strings that str1 and str2 are pointing at.

The escape sequence \u0122 represents the numeric value 0x0122 hex (290 decimal), which in Unicode is codepoint U+0122, hence C4 A2 in UTF-8. So, if you have an input string like this:

const char *str = "abc\\u0122"; // {'a', 'b', 'c', '\', 'u', '0', '1', '2', '2', '\0'}

And you want to decode it to UTF-8, you would need to detect the "\u" prefix, extract the following "0122" substring, parse it as a hex number into an integer, interpret that integer as a Unicode codepoint, and convert it to UTF-8 (a, b, and c are already valid chars as-is in UTF-8).

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
1

UTF-8 can't work in a simplistic way of breaking a large value into individual bytes, because it would be ambiguous. How would you tell the difference between "\u4142" (䅂) and the two character string "AB"?

The rules for producing a UTF-8 byte string from a Unicode code point number are quite simple, and eliminate the ambiguity. Given any sequence of byte values, it either defines unambiguous codepoints or it's an invalid sequence.

Here's a simple function that will convert a single Unicode codepoint value to a UTF-8 byte sequence.

void codepoint_to_UTF8(int codepoint, char * out)
/* out must point to a buffer of at least 5 chars. */
{
    if (codepoint <= 0x7f)
        *out++ = (char)codepoint;
    else if (codepoint <= 0x7ff)
    {
        *out++ = (char)(0xc0 | ((codepoint >> 6) & 0x1f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    else if (codepoint <= 0xffff)
    {
        *out++ = (char)(0xe0 | ((codepoint >> 12) & 0x0f));
        *out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    else
    {
        *out++ = (char)(0xf0 | ((codepoint >> 18) & 0x07));
        *out++ = (char)(0x80 | ((codepoint >> 12) & 0x3f));
        *out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    *out = 0;
}

Note that this function does no error checking, so if you give it an input outside the valid Unicode range of 0 to 0x10ffff it will generate an incorrect (but still valid) UTF-8 sequence.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622