I have a pointer to a stream of bytes encoded in UTF8. I am trying to publish this byte stream as a JSON compatible string.
It worked fine until I hit the em dash. At this point the output of my program began to spit out garbage.
I was using snprintf
to get the job done like so:
if (nUTF8CodePoints == 2)
{
DebugLog(@"2 Unicode code points");
snprintf( myEscapedUnicode, 8, "\\u%2x%2x",*cur,*(cur+1));
}
else if (nUTF8CodePoints == 3)
{
DebugLog(@"3 Unicode code points");
snprintf( myEscapedUnicode, 8, "\\u%2x%2x%2x",*cur,*(cur+1),*(cur+2));
}
else if (nUTF8CodePoints == 4)
{
DebugLog(@"4 Unicode code points");
snprintf( myEscapedUnicode, 8, "\\u%2x%2x%2x%2x",*cur,*(cur+1),*(cur+2),*(cur+3));
}
This code gives me \ue2809
where I expected U+2014
. Now I am confused. I thought the U+XXXX mean that the XXXX
was supposed to be hex. Yet-hex representation is giving me a different 6 digits than the 4 that are expected. How am I supposed to encode this to the expected JSON compatible UTF-8?
Something tells me I'm close, but not cigar. For example, the utf8-chartable.de em dash entry concurs with me that there is a difference. Still, I don't quite yet understand what it and am not sure how to get C to print it.
U+2014 — e2 80 94 EM DASH
So how do I print out these 3 bytes (e2 80 94) as U+2014? And what does the XXXX
mean in this U+2014? I thought it was supposed to be hex.