14

I have to go through some text and write the UTF-8 output according to the character patterns. I thought it’ll be easy if I can work with the code points and get it converted to UTF-8. I have been reading about Unicode and UTF-8, but couldn’t find a good solution. Any help will be appreciated.

ib.
  • 27,830
  • 11
  • 80
  • 100
chanux
  • 1,829
  • 4
  • 16
  • 20
  • Does this answer your question? [UTF conversion functions in C++11](https://stackoverflow.com/questions/38688417/utf-conversion-functions-in-c11) – Dúthomhas Jan 18 '22 at 20:44

6 Answers6

42

Converting Unicode code points to UTF-8 is so trivial that making the call to a library probably takes more code than just doing it yourself:

if (c<0x80) *b++=c;
else if (c<0x800) *b++=192+c/64, *b++=128+c%64;
else if (c-0xd800u<0x800) goto error;
else if (c<0x10000) *b++=224+c/4096, *b++=128+c/64%64, *b++=128+c%64;
else if (c<0x110000) *b++=240+c/262144, *b++=128+c/4096%64, *b++=128+c/64%64, *b++=128+c%64;
else goto error;

Also, doing it yourself means you can tune the api to the type of work you need (character-at-a-time? Or long strings?) You can remove the error cases if you know your input is a valid Unicode scalar value.

The other direction is a good bit harder to get correct. I recommend a finite automaton approach rather than the typical bit-arithmetic loops that sometimes decode invalid sequences as aliases for real characters (which is very dangerous and can lead to security problems).

Even if you do end up going with a library, I think you should either try writing it yourself first or at least seriously study the UTF-8 specification before going further. A lot of bad design can come from treating UTF-8 as a black box when the whole point is that it's not a black box but was created to have very powerful properties, and too many programmers new to UTF-8 fail to see this until they've worked with it a lot themselves.

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 8
    @Philipp: Is writing more code to wrap a library to match your interface needs and work around its bugs any better? If you care to browse the existing library code that decodes UTF-8, you'll find that the vast majority is wrong in at least subtle ways, and at least 30% has serious security-critical bugs. (These estimates come from a Google code search I did a while back.) Also, the GNU implementation of `iconv` is orders of magnitude too slow for character-at-a-time conversions, though it works alright (albeit with intentional nonconformance) for bulk conversions. – R.. GitHub STOP HELPING ICE Jan 06 '11 at 16:08
  • my shot at a more advanced version: http://mercurial.intuxication.org/hg/cstuff/raw-file/tip/utf8_encode.c – Christoph Jan 06 '11 at 20:47
  • 4
    Rejecting non-characters may be useful for your application, but it's not part of the UTF-8 specification and in general incorrect. UTF's are one-to-one maps between sequences of code units (bytes or larger words) and "Unicode Scalar Values". The Unicode Scalar Values are exactly the integers 0-0xD7FF and 0xE000-0x10FFFF. This is all defined in the Unicode standard which you should read before trying to implement something of your own. – R.. GitHub STOP HELPING ICE Jan 06 '11 at 21:37
  • @R..: thanks for the info; the code is adapted from stuff I wrote some time ago, and which only ever operated on characters (ie excluded non-characters, surrogates as well as ascii control characters), so the details weren't as present as they should have been; however, I'm not convinced if it's worth to add another validation layer – Christoph Jan 07 '11 at 00:49
  • 1
    +1 for avoiding lib calls for such trivial stuff. People too often forget the cost of dynamic library calls (often it's a call+indirect jump or it's a far absolute call). If the call is for something heavy like `printf` no problem, it's negligible but for a unicode character conversion, it's huge. – Patrick Schlüter May 27 '11 at 16:40
  • 2
    @R.. : Please explain what is`b`and what is`c`! What variable represent the code point ?. To which value`b`is initialized to ? – user2284570 Aug 31 '15 at 17:07
  • 1
    @user2284570: `c` is the codepoint (input) and `b` is a pointer to the output buffer (bytes). – R.. GitHub STOP HELPING ICE Aug 31 '15 at 17:51
  • @R.. : I guess`c`is int32 and`b`char* ? Anyway, you should reflect this by editing you answer. In fact, I want to generate an html table listing unicode values. Conversion with escaped code slow down parsers, and make the html file larger. So using directly encoded ᴜᴛꜰ‑8 is better. – user2284570 Aug 31 '15 at 18:12
  • @R.. : Wait… Your code is wrong! ᴜᴛꜰ‑8 is always big endian and this code isn’t endian neutral. It would only work on big‑endian machines whereas most of them are little-endian. – user2284570 Aug 31 '15 at 21:09
  • @user2284570: UTF-8 is a byte stream. It does not have endianness. There is no such thing as endianness unless you are inspecting or modifying the representation of types. – R.. GitHub STOP HELPING ICE Aug 31 '15 at 21:11
  • @R.. : But for example, let’s say`192+c/64`is equal to 11010000. Wouldn’t a little endian machine write 00001011 in the output file ? – user2284570 Aug 31 '15 at 21:17
  • 1
    @user2284570: No. A file is a sequence of bytes not a sequence of bits. Endianness is byte order. This is a consequence of the fact that you address bytes, not bits. Some big endian CPU vendors number the bits of a byte backwards in their technical docs, but this is purely a notational quirk and has nothing to do with data interchange. On serial ports there is of course a bit order, but that's defined by the hardware, not cpu endianness. – R.. GitHub STOP HELPING ICE Aug 31 '15 at 21:19
  • The statement "UTF's are one-to-one maps between sequences of code units (bytes or larger words) and "Unicode Scalar Values". The Unicode Scalar Values are exactly the integers 0-0xD7FF and 0xE000-0x10FFFF." overlooks D92 on p. 124 of the current Unicode Standard: https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf D92 states" "Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill- formed." – Thomas Hedden Feb 13 '22 at 21:22
  • @ThomasHedden: Huh? I don't understand what point you're trying to make because you seem to have said exactly what I did. – R.. GitHub STOP HELPING ICE Feb 13 '22 at 23:30
5

iconv could be used I figure.

#include <iconv.h>

iconv_t cd;
char out[7];
wchar_t in = CODE_POINT_VALUE;
size_t inlen = sizeof(in), outlen = sizeof(out);

cd = iconv_open("utf-8", "wchar_t");
iconv(cd, (char **)&in, &inl, &out, &outlen);
iconv_close(cd);

But I fear that wchar_t might not represent Unicode code points, but arbitrary values.. EDIT: I guess you can do it by simply using a Unicode source:

uint16_t in = UNICODE_POINT_VALUE;
cd = iconv_open("utf-8", "ucs-2");
user562374
  • 3,817
  • 1
  • 22
  • 19
  • 3
    What if the code point is not in the BMP? ucs-2 can't represent it. One wchar_t may not be enough according to the platform. This is why I think that the OP's assumption about knowing the code point is wrong. Because then, the question of the encoding used to represent it is asked (UTF-32? UTF-16? obviously not UTF-8) – Serge Wautier Jan 05 '11 at 18:15
  • 1
    If `__STDC_ISO_10646__` is defined, `wchar_t` is a Unicode codepoint value. Note that if `wchar_t` is 16-bit, this implies that only the BMP is supported; UTF-16 is not a possibility. – R.. GitHub STOP HELPING ICE Jan 05 '11 at 22:56
  • 2
    A 16-bit `wchar_t` can definately be used in UTF-16 encoded strings. All it means is that any codepoint value outside of the BMP will be encoded using 2 `wchar_t` surrogates characters side by side in the encoded string, that's all. The Windows API operates on exactly this kind of data, and it works just fine. – Remy Lebeau Jan 09 '11 at 09:31
  • @RemyLebeau: The C API for `wchar_t` conversion does not make such usage possible. There is no way for `mbrtowc` to generate a pair of `wchar_t` values as the result of its conversion. It can only generate one. I have no idea what Windows is doing, but it can't be providing a working version of these standard functions; it must be using some Windows-specific API instead and ignoring the fact that the standard functions don't work... – R.. GitHub STOP HELPING ICE Oct 04 '13 at 05:13
  • Many standard C API functions delegate to OS functions internally when appropriate. It does not make sense for compiler vendors to do everything manually. That includes text conversions. On Windows, text conversions are handled by the Win32 API `WideCharToMultiByte()` and `MultiByteToWideChar()` functions, both of which operate on UTF-16 encoded `wchar_t` data. All Unicode-enabled APIs on Windows are based on UTF-16, and have been for over a decade. – Remy Lebeau Oct 04 '13 at 15:14
4

A good part of the genius of UTF-8 is that converting from a Unicode Scalar value to a UTF-8-encoded sequence can be done almost entirely with bitwise, rather than integer arithmetic.

The accepted answer is very terse, but not particularly efficient or comprehensible as written. I replaced magic numbers with named constants, divisions with bit shifts, modulo with bit masking, and additions with bit-ors. I also wrote a doc comment pointing out that the caller is responsible for ensuring that the buffer is large enough.

#define SURROGATE_LOW_BITS 0x7FF
#define MAX_SURROGATE     0xDFFF
#define MAX_FOUR_BYTE   0x10FFFF
#define ONE_BYTE_BITS          7
#define TWO_BYTE_BITS         11
#define TWO_BYTE_PREFIX     0xC0
#define THREE_BYTE_BITS       16
#define THREE_BYTE_PREFIX   0xE0
#define FOUR_BYTE_PREFIX    0xF0
#define CONTINUATION_BYTE   0x80
#define CONTINUATION_MASK   0x3F

/**
 * Ensure that buffer has space for AT LEAST 4 bytes before calling this function,
 *   or a buffer overrun will occur.
 * Returns the number of bytes written to buffer (0-4).
 * If scalar is a surrogate value, or is out of range for a Unicode scalar,
 *   writes nothing and returns 0.
 * Surrogate values are integers from 0xD800 to 0xDFFF, inclusive.
 * Valid Unicode scalar values are non-surrogate integers between
 *   0 and 1_114_111 decimal (0x10_FFFF hex), inclusive.
 */
int encode_utf_8(unsigned long scalar, char* buffer) {
  if ((scalar | SURROGATE_LOW_BITS) == MAX_SURROGATE || scalar > MAX_FOUR_BYTE) {
    return 0;
  }

  int bytes_written = 0;

  if ((scalar >> ONE_BYTE_BITS) == 0) {
    *buffer++ = scalar;
    bytes_written = 1;
  }
  else if ((scalar >> TWO_BYTE_BITS) == 0) {
    *buffer++ = TWO_BYTE_PREFIX | (scalar >> 6);
    bytes_written = 2;
  }
  else if ((scalar >> THREE_BYTE_BITS) == 0) {
    *buffer++ = THREE_BYTE_PREFIX | (scalar >> 12);
    bytes_written = 3;
  }
  else {
    *buffer++ = FOUR_BYTE_PREFIX | (scalar >> 18);
    bytes_written = 4;
  }
  // Intentionally falling through each case
  switch (bytes_written) {
    case 4: *buffer++ = CONTINUATION_BYTE | ((scalar >> 12) & CONTINUATION_MASK);
    case 3: *buffer++ = CONTINUATION_BYTE | ((scalar >>  6) & CONTINUATION_MASK);
    case 2: *buffer++ = CONTINUATION_BYTE |  (scalar        & CONTINUATION_MASK);
    default: return bytes_written;
  }
}
Clement Cherlin
  • 387
  • 6
  • 13
2

libiconv.

Conrad Meyer
  • 2,851
  • 21
  • 24
1

Which platform? On Windows, you can use WideCharToMultiByte(CP_UTF8,...)

Arguably, the source codepoint must be encoded in UTF-16, which means you must be able to do such encoding. In some cases (surrogate pairs), it's not trivial.

My understanding is that you have some text in a given codepage and you want to convert it to Unicode (UTF-16). Right? A MultiByteToWideChar(codePage, sourceText,...) / WideCharToMultiByte(CP_UTF8, utf16Text,...) roundtrip will do the trick.

Serge Wautier
  • 21,494
  • 13
  • 69
  • 110
0

I agree with Clement that the accepted answer doest not explain things very well. The following document explains things in a very simple way:

Yergeau, F. 2003. UTF-8, a transformation format of ISO 10646. RFC 3629, section 3, pp. 3-4.

The following book ...

Korpela, Jukka K. 2006. Unicode Explained. Sebastopol, etc.: O'Reilly Media, Inc. ... provides a good general explanation of UTF-8 on page 298.

Thomas Hedden
  • 403
  • 3
  • 9