0

I have a pointer to a stream of bytes encoded in UTF8. I am trying to publish this byte stream as a JSON compatible string.

It worked fine until I hit the em dash. At this point the output of my program began to spit out garbage.

I was using snprintf to get the job done like so:

if (nUTF8CodePoints == 2)
{
    DebugLog(@"2 Unicode code points");
    snprintf( myEscapedUnicode, 8, "\\u%2x%2x",*cur,*(cur+1));
}
else if (nUTF8CodePoints == 3)
{
    DebugLog(@"3 Unicode code points");
    snprintf( myEscapedUnicode, 8, "\\u%2x%2x%2x",*cur,*(cur+1),*(cur+2));
}
else if (nUTF8CodePoints == 4)
{
    DebugLog(@"4 Unicode code points");
    snprintf( myEscapedUnicode, 8, "\\u%2x%2x%2x%2x",*cur,*(cur+1),*(cur+2),*(cur+3));
}

This code gives me \ue2809 where I expected U+2014. Now I am confused. I thought the U+XXXX mean that the XXXX was supposed to be hex. Yet-hex representation is giving me a different 6 digits than the 4 that are expected. How am I supposed to encode this to the expected JSON compatible UTF-8?

Something tells me I'm close, but not cigar. For example, the utf8-chartable.de em dash entry concurs with me that there is a difference. Still, I don't quite yet understand what it and am not sure how to get C to print it.

  U+2014    —   e2 80 94    EM DASH

So how do I print out these 3 bytes (e2 80 94) as U+2014? And what does the XXXX mean in this U+2014? I thought it was supposed to be hex.

ovatsug25
  • 7,786
  • 7
  • 34
  • 48
  • 2
    UTF-8 is an encoding. You need to decode it first. See https://en.wikipedia.org/wiki/UTF-8#Encoding – Cory Nelson Jul 08 '21 at 21:10
  • @CoryNelson To clarify—I have a pointer to a stream of bytes encoded in UTF8. I am trying to publish this byte stream as a JSON compatible string. For this to be so, I need to take this 3 long UTF8 character and print it out in this format: `U+XXXX`. From my reading, the `XXXX` is supposed to be hex—but the hex representation is giving me a different 6 digits than the 4 that are expected. How am I supposed to encode this? – ovatsug25 Jul 08 '21 at 21:40
  • You need to *convert* from UTF-8 to a codepoint. You could start with the code at: https://stackoverflow.com/a/148766/5987 – Mark Ransom Jul 08 '21 at 21:49
  • Inspect the first byte to find out the length (the number of following bytes belonging to the same point) – wildplasser Jul 08 '21 at 22:04
  • @wildplasser - sorry. still confused. I already have the length in the variable `nUTF8CodePoints`. For ASCII where length is 1—I didn't include the code because I just append that to the buffer. But for a unicode character—I'm not sure what to append to the buffer. I though the `codepoint` was supposed to be in hex but apparently it is something else. That is what I am trying to find out – ovatsug25 Jul 08 '21 at 22:07
  • What is different between a `UTF-32 (hex) 0x00002014 (2014)` and a `UTF-8 (hex) 0xE2 0x80 0x94 (e28094)` hex? Why are they both called hex when they are different? – ovatsug25 Jul 08 '21 at 22:09
  • 1
    As Steve notes below, a "stream of bytes encoded in UTF8" *is* "a JSON compatible string." I would explore the code that "began to spit out garbage." It sounds like it has a bug. Broadly speaking, UTF8 is designed so that if you don't actually need to interpret it, you can avoid decoding it at all and treat it like ASCII (as long as you're a bit careful). What you're describing doesn't sound like it requires conversion. Can you discuss the code that's actually giving you a problem? – Rob Napier Jul 08 '21 at 22:24

1 Answers1

3

As I understand it, JSON is allowed (with an exception) to contain UTF-8-encoded text as-is. So to start out with, I don't think you need to treat your Unicode characters specially, or try to turn them into \uXXXX escape sequences, at all.

If you do want to emit a \uXXXX sequence, you're going to have to convert from UTF-8 back to a "pure" Unicode character (or, formally, something more like UTF-16). One way to do this -- at least, if your C library is up to it, and you've got your locale set correctly -- is with the mbtowc function. I think you should be able to use it something like this:

setlocale(LC_CTYPE, "UTF-8")

wchar_t wc;
mbtowc(&wc, cur, nUTF8CodePoints);
snprintf(myEscapedUnicode, 8, "\\u%04x", wc);

The only wrinkle is characters that don't fit in 16 bits, or stated another way, characters outside the Basic Multilingual Plane (BMP). Although UTF-8 can handle these just fine, as I understand it, in JSON they must be encoded as surrogate pairs, using \u notation. (I learn this from Wikipedia; I don't claim any JSON expertise here.)

So far I've ducked this requirement in my own JSON work. I'll go out on a limb and guess it would look something like this (see this description of low and high surrogates in Wikipedia):

if(w > 0xffff) {
    unsigned int lo = (w - 0x10000) & 0x3ff;
    unsigned int hi = ((w - 0x10000) >> 10) & 0x3ff;
    snprintf(myEscapedUnicode, 12, "\\u%04x\\u%04x", hi + 0xD800, lo + 0xDC00);
}

Note that this is going to take more than 8 bytes in myEscapedUnicode.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • As a practical matter, totally agreed. You can just encode the doc in UTF-8 without requiring any escaping. RFC-8259 requires that JSON documents be UTF-8 encoded unless the documents are "exchanged between systems that are not part of a closed ecosystem." ECMA-404 just requires that it be Unicode code points without requiring a specific encoding. – Rob Napier Jul 08 '21 at 22:19
  • 1
    Technically, JSON has no concept of UTF-8, only of sequences of **Unicode codepoints**. UTF-8 is a byte encoding for Unicode, where each byte is a **code unit**, it takes 1-4 codeunits to represent 1 codepoint in UTF-8, depending on its value. UTF-8 is the *preferred* byte encoding for JSON, but it is not the only one possible. To handle this *properly*, the UTF-8 codeunits need to be *decoded* into Unicode codepoints, which can then be inserted into the JSON, and then the JSON can be encoded to a desired byte encoding. If that happens to be UTF-8, then you can skip the decode/re-encode step – Remy Lebeau Jul 08 '21 at 22:23
  • Concluding: parsing the source utf8 encoded bytestream is trivial. If the target (json) imposes restriction s(fitting in 16bit codepoints) tt can get tricky. But if json allows utf8-encoding: let it deal with it. An extra complication: you have to choose how to deal with violations. – wildplasser Jul 08 '21 at 22:23
  • Very helpful. The locale was key in geting it to work on my mac. @RemyLebeau - your comment is very helpful. Still a bit confused (end of workday) but tomorrow it should click. The locale on my mac was set to C. Which is a whole other can of worms. (setlocale(NULL) gets you the locale!) Question—is the locale PID specific or is it for the OS? – ovatsug25 Jul 08 '21 at 22:29
  • @ovatsug25 I suggest you read up on [how UTF-8 actually works](https://en.wikipedia.org/wiki/UTF-8), and read the [JSON spec](https://www.ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf) (section 9) on how the `\uXXXX` format is used to represent Unicode characters. – Remy Lebeau Jul 08 '21 at 22:32
  • @ovatsug25 Locale setting is somewhat unusual. By default you get the "C" locale, which is old and backwards-compatible and doesn't do Unicode explicitly (although it can implicitly handle UTF-8 just fine, which is one reason UTF-8 is so wonderful). A program that wants to handle Unicode explicitly, for example by calling functions like `mbtowc`, needs to call `setlocale`. Normally I call `setlocale(LC_CTYPE, "")`, which sets the locale based on environment variables. In this example I explicitly requested the locale "UTF-8", which seems to work on my Mac, but I doubt it's universal. – Steve Summit Jul 08 '21 at 22:33
  • 1
    There are probably best practices on setting the locale, that are documented somewhere, that I don't know about, so if this is a production program you're working on you'll want to research that, too. (There are also variants like `mbtowc_l` that let you pass in a locale specifier, rather than setting it globally with `setlocale`.) – Steve Summit Jul 08 '21 at 22:35
  • 1
    @ovatsug25 But in any case, when you call `setlocale` it's just for your process, not the whole system. – Steve Summit Jul 08 '21 at 22:36