How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

Question

I don't know how to solve that:

Imagine, we have 4 websites:

A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16

My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".

The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:

How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

A good cross-platform library for handling Unicode (codepoint properties, charset conversions etc) is [IBM's ICU](http://site.icu-project.org/) even though it is probably overkill for your needs. — syam, Apr 25 '13 at 07:03
*I think wchar_t does not work because it is 2 bytes long* => it's worse than that `wchar_t` is compiler/target specific, with MSVC it'll be 2 bytes long, but with gcc and clang it's 4 bytes long. — Matthieu M., Apr 25 '13 at 07:41
@syam plenty of projects and products use ICU only for the conversion functions, so I wouldn't think of it as overkill. You could use just the common library or even statically link. — Steven R. Loomis, Apr 26 '13 at 00:58

DevSolar · Accepted Answer · 2013-04-25T08:00:10.603

How can I convert "any" (or the most used) character encoding to UTF-8?

ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).

You create a converter for a given encoding...

#include <ucnv.h>

UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
    // ...
    ucnv_close( converter );
}

...and then use the UnicodeString class as appripriate.

I think wchar_t does not work because it is 2 bytes long.

The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.

Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

Check BreakIterator.

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.

Just a remark: If you use UTF-8 or UTF-16 internally, you cannot represent invalid UTF-8, and even going for UTF-32 does not buy you freedom from combining sequences and other troubles. [UTF-8 Everywhere manifesto](utf8everywhere.org) — Deduplicator, Apr 27 '14 at 11:04

score 4 · Answer 2 · answered Apr 25 '13 at 07:13

No sure if this will give you everything you're looking for but it might help a little. Have you tried looking at:

1) Boost.Locale library ? Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

Here are some convenient examples from the docs:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

2) Or at conversions are part of C++11?

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

Solution 1 with Boost sounds really good! I will test it. Thanks :) — Christoph, Apr 25 '13 at 07:35
@Christoph: You might want to take note that the Unicode capabilities of Boost.Locale are achieved by Boost.Locale being basically a wrapper for ICU... and let me tell you, getting Boost to link to ICU on Windows is non-trivial, tends to break between releases, and has cost me a couple of man-weeks in the last few years. — DevSolar, Apr 25 '13 at 07:49

Jakob Riedle · Answer 3 · 2018-03-21T12:17:27.300

How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...

This is easy, there is a project named tinyutf8 , which is a drop-in replacement for std::string/std::wstring.

Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.

How can I convert "any" (or the most used) character encoding to UTF-8?

You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).

Joop Eggen · Answer 4 · 2013-04-25T08:27:54.203

UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).

It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.

Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.

Content-Type: text/html; charset=UTF-8

If not present, there also there might be

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8">      <!-- HTML5 -->

ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such. Even browsers on MacOS will understand these though ISO-8859-1 was specified.

Conversion libraries: alread mentioned by @syam.

Conversion

Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.

The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.

table[157] = "\xEF\xBF\xBD";


public static void main(String[] args) {
    final String SOURCE_ENCODING = "windows-1252";
    byte[] sourceBytes = new byte[1];
    System.out.println("    const char* table[] = {");
    for (int c = 0; c < 256; ++c) {
        String comment = "";
        System.out.printf("       /* %3d */ \"", c);
        if (32 <= c && c < 127) {
            // Pure ASCII
            if (c == '\"' || c == '\\')
                System.out.print("\\");
            System.out.print((char)c);
        } else {
            if (c == 0) {
                comment = " // Unusable";
            }
            sourceBytes[0] = (byte)c;
            try {
                byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
                for (int j = 0; j < targetBytes.length; ++j) {
                    int b = targetBytes[j] & 0xFF;
                    System.out.printf("\\x%02X", b);
                }
            } catch (UnsupportedEncodingException ex) {
                comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
            }
        }
        System.out.print("\"");
        if (c < 255) {
            System.out.print(",");
        }
        System.out.println();
    }
    System.out.println("    };");
}

As I said: The process of parsing or finding out which encoding is used is not the problem. The problem is the conversion from eg. latin1 to UTF-8. — Christoph, Apr 25 '13 at 07:12

How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8

4 Answers4

Linked