UTF-8 String Iterators

Question

I am trying to write a Unicode-supported cross-platform application. I am using the library UTF8-C++ ( http://utfcpp.sourceforge.net/ ) but I am having trouble iterating through a string:

string s1 = "Добрый день";
utf8::iterator<string::iterator> iter(s1.begin(), s1.begin(), s1.end());

for(int i = 0; i < utf8::distance(s1.begin(), s1.end()); i++, ++iter)
{
    cout << (*iter);
}

The above code, when redirected to a UTF-8 formatted text file, produces the following output:

6 3 6 3 6 3 6 3 6 3 6 3 3 2 6 3 6 3 6 3 6 3

How can I get the content of s1 to appear in the file properly?

Where are the spaces in the output coming from? What encoding is your source file in? Which version of the library are you using? — ecatmur, Aug 23 '12 at 16:16
The spaces are a mystery to me as well. The encoding of the source file is UTF-8. I am using v2_3_2 of the library. — Qman, Aug 23 '12 at 16:22
Whatever compiler you're using I think it doesn't like Unicode string literals. Try escape codes maybe. Or maybe the file is automatically ASCII... verify that's not the case. — user541686, Aug 23 '12 at 16:27
@Mehrdad I am compiling my code in Visual Studio 2010. How do I get the compiler to support Unicode? (And what are escape codes?). (The file is not ascii. I changed it's encoding to UTF-8, deleted it's content, and only then piped the output of the program into it). — Qman, Aug 23 '12 at 16:28
I think there is a compiler option if I remember right... Check on MSDN. (I'm on my phone so sorry it's terse.) — user541686, Aug 23 '12 at 16:30
As per part of my answer (I did get one part right), you need to use std::wstring and prefix your unicode string literal with 'L'. `std::wstring s1 = L"Добрый день";` — Gareth Wilson, Aug 23 '12 at 16:31
`utf8::iterator` yields `uint32_t`, which is an integer type. Your program should be printing a string of numbers all smashed together. Have you done something weird to `cout`? — ecatmur, Aug 23 '12 at 16:32
@GarethWilson not at all; if the source file is UTF-8 then a normal string literal will contain the correct byte sequence. `wstring` is the wrong width to store UTF-8; it needs to be (8-bit) `char`. C++11 has UTF-8 string literals `u8"Добрый день"` which guarantee UTF-8 storage. — ecatmur, Aug 23 '12 at 16:35
@ecatmur I believe this is VS2010, so no u8 support, and was getting '?' for all the characters (other than space) which is why I suggested wstring which does at least store the string correctly, on a cursory examination (VS debug tool tips). — Gareth Wilson, Aug 23 '12 at 16:37
@ecatmur I have not done anything weird to `cout`. My compiler does not support `u8`. How do I get it to support it? — Qman, Aug 23 '12 at 16:37
Yes. I want to be able to take a UTF8 string, go through each of its characters, and write it to a file. — Qman, Aug 23 '12 at 16:44
In VC++2010 you can use C++11 Unicode features and std::codecvt_utf8_utf16 facet in particular (encapsulates conversion between a UTF-8 encoded byte string and UTF-16 encoded character string). See reference - http://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16. Note that VC++ 2010 does not implement "u8" string prefix. — SChepurin, Aug 23 '12 at 17:03
You really do not want to use UTF-8 internally. It is a fantastic storage and transport format but using it in code is a real pain. transform it into UTF-32 (Unix) or UTF-16 (relay UCS-2) (Win) and use the fixed size qualities. — Martin York, Aug 23 '12 at 17:14
By the way, see how to write in file here - http://stackoverflow.com/questions/11646368/how-to-set-file-encoding-format-to-utf8-in-c/11647084#11647084. I guess you could figure out how to read UTF-8 file, handle it in UTF-16 and output UTF-8 in file. — SChepurin, Aug 23 '12 at 17:16
@LokiAstari, and all other commenters - please read: http://www.utf8everywhere.org. Yes, you do want to use utf8 in memory and forget about any encoding conversions for life... — Pavel Radzivilovsky, Aug 23 '12 at 20:27
@PavelRadzivilovsky: About the only think I agree with is that UTF-16 is the worst of all worlds. Having had extensive experience in the field "just like you" a fixed width format is really the only way to go when doing string manipulation. Weather multiple code points makes up a single glyph is really irrelevant in most processing fields of work and only of relevance for display thus negating most of the other arguments which seem more of a rant on Windows rather an organized discussion. — Martin York, Aug 23 '12 at 21:29
In my experience it's usually irrelevant whether multiple code units make up a single code point as well. Either I can work in terms of code units (when normalization isn't an issue; copying, concatenation, simple searches, etc.) or I need to know that multiple code points must be treated as a single entity (and not just for display; for cursor movement, regexs, splitting, etc.). Fixed width encodings bring no value. Unicode characters are fundamentally variable length. — bames53, Aug 24 '12 at 18:11
"fixed width format is really the only way to go when doing string manipulation" - not really. See FAQ 18 of utf8everywhere. You will see that fixed width is actually not so good, and variable is not so bad. — Pavel Radzivilovsky, Aug 24 '12 at 21:23

bames53 · Answer 1 · 2016-10-07T07:01:30.300

You need to ensure that the string is being initialized with the correct data, and then that the iterator is producing the correct values.

You're using VS2010, so there's a bit of a problem with string literals. C++ implementations have an 'execution character set' to which they convert character and string literals from the 'source character set'. Visual Studio does not support UTF-8 as an execution character set, and therefore will not intentionally produce a UTF-8 encoded string literal.

You can get one by tricking the compiler, or by using hex escapes. Also instead of getting a UTF-8 string literal you could just get a wide string containing the correct data and then convert it at runtime to UTF-8.

edit: More recent versions of Visual Studio do now have ways to get UTF-8 string literals. Visual Studio 2015 now supports C++11's UTF-8 string literals. In Visual Studio 2015 Update 2 you can also use the compiler flags /execution-charset:utf-8 or /utf-8.

Tricking the compiler

If you save the source code as 'UTF-8 without signature' then the compiler will think that the source encoding is the system locale encoding. VS always uses the system locale encoding as the execution encoding. So when it thinks the source and execution encodings are the same it will not perform any conversion and your source bytes, which will actually be UTF-8, will be used directly for the string literal thus producing a UTF-8 encoded string literal. (note that this breaks the conversion done for wide character and string literals.)

Hex escapes

Hex escape codes let you manually insert code units (bytes in this case) of any value into a string literal. You can manually determine the UTF-8 encoding you want and then insert those values into the string literal.

std::string s1 = "\xd0\x94\xd0\xbe\xd0\xb1\xd1\x80\xd1\x8b\xd0\xb9 \xd0\xb4\xd0\xb5\xd0\xbd\xd1\x8c";

UTF-8 string literal prefix

C++11 specifies a prefix that creates a UTF-8 string literal regardless of the execution encoding, however Visual Studio does not implement this yet. This looks like:

string s1 = u8"Добрый день";

It requires that the compiler know and use the correct source encoding (and therefore that the source encoding support the desired string). The compiler then does the conversion from the source encoding to UTF-8 instead of to the execution encoding. When Visual Studio supports this feature you'll probably want to save your source code as 'UTF-8 with signature.' (Again, VS depends on the signature to identify UTF-8 source.)

After you have a UTF-8 string then, assuming the UTF-8 iterator works, your example code should produce the correct 11 code points and I think the output text should look like:

104410861073108810991081321076107710851100

Insert some spaces to make it readable and you can verify that you're getting the right values:

1044 1086 1073 1088 1099 1081 32 1076 1077 1085 1100

Or make it hex and add the Unicode prefix:

U+0414 U+043e U+0431 U+0440 U+044b U+0439 U+0020 U+0434 U+0435 U+043d U+044c

If you actually want to produce a UTF-8 encoded output file then you shouldn't be using the utf-8 iterator anyway.

string s1 = "Добрый день";
std::cout << s1;

When the output is redirected to a file then the file will contain the UTF-8 encoded data:

Добрый день

I don't understand why your actual output currently contains a bunch of extra spaces, but it looks like the actual numbers that are being accessed are:

63 63 63 63 63 63 32 63 63 63 63

63 is the ascii code for '?' and 32 is the ascii code for a space; ?????? ????. So you are clearly suffering from VC++'s conversion of the string literal to the system locale encoding.

Just as note - C++11 have UTF-8 literals in standard `u8"This is UTF-8 string"`. — Maciej Piechotka, Aug 24 '12 at 16:22

Gareth Wilson · Answer 2 · 2012-08-23T16:47:23.317

-1

Answer updated. Use wstring (best given VS2010 I think) to store a UTF16 string, convert to UTF8, and output.

This works for me when I view in a UTF8 compatible editor (Scite).

    std::wstring s1 = L"Добрый день";
    std::vector<unsigned char> UTF8;

    utf8::utf16to8( s1.begin(), s1.end(), std::back_inserter( UTF8 ) );

    for( auto It = UTF8.begin() ; It < UTF8.end() ; ++It )
    {
        std::cout << (*It);
    }

I don't think there's a way in VS2010 to have a UTF8 literal or string object, UTF16 (wstring) I think is your best bet internally, then use the UTF8 library to convert to/from UTF8 when export to files/network, etc.

edited Aug 23 '12 at 16:47

answered Aug 23 '12 at 16:14

Gareth Wilson

855
6
5

When I changed the program as you instructed I got the following output: `2 0 6 2 4 9 6 4 7 5 5 7 3 2 5 2 5 3 6 1 7 6 `. Additionally, I got a warning: `warning C4244: 'argument' : conversion from 'wchar_t' to 'utf8::uint8_t', possible loss of data`. – Qman Aug 23 '12 at 16:19
2

A wide string is _never_ stored in UTF8, so iterating over it with a UTF8 iterator doesn't make much sense. – Mooing Duck Aug 23 '12 at 16:22
Sorry, maybe I misunderstood the question - what is you're expecting it to output? The actual string? @MooingDuck Yeah, I latched onto the wrong problem - in that they were trying to store a wide string into a normal string and getting '?'. – Gareth Wilson Aug 23 '12 at 16:23
Yes I am trying to print the actual string in the file using the iterator (just as a test to make sure I can deal with individual characters in a utf8 string). – Qman Aug 23 '12 at 16:25
1

Then you don't want to use UTF8 iterators, that gets you code-points of the characters, not something usually displayable. The UTF8 library is more for converting from UTF8 into normal, or wide strings, which can then be displayed as normal. – Gareth Wilson Aug 23 '12 at 16:29
@Griwes Really? I've not used it much myself. – Gareth Wilson Aug 23 '12 at 16:35
@GarethWilson, or rather spawn of M$. While every sane environment uses UTF-8, they started to use UTF-16, making such ridiculous classes needed. – Griwes Aug 23 '12 at 16:38
@Griwes multi-byte character sets predate UTF-16. – ecatmur Aug 23 '12 at 16:50
@GarethWilson This is the output of your updated solution: `h%� h%[%h%�%d%� d%� h%c% h%$%h%a%h%\%d%� ` – Qman Aug 23 '12 at 16:50
@GarethWilson UTF-16 wide strings are only "normal" on Windows. Unix and most of the Internet uses UTF-8. – ecatmur Aug 23 '12 at 16:51
@ecatmur Hmm, true. I'm not sure what to suggest for a fully portable solution. UTF8 to UTF16 for Windows, and keep in UTF8 for everything else. Could be wrapped in a suitable class with platform #ifdefs I guess. Qman That's odd, I get the original string back when viewed in a UTF8 editor. – Gareth Wilson Aug 23 '12 at 16:53
@ecatmur, of course they do, but the only real need for them arose from windows. – Griwes Aug 23 '12 at 16:53
@GarethWilson, fully portable solution is to use proper Unicode library, like ICU or Qt, to keep everything simple. – Griwes Aug 23 '12 at 16:54
@Griwes I guess put that in an answer then as something for QMan to look into. I've tried my best to solve the 'problem' present by QMan, but his bigger goal is a cross-platform solution, so those libraries may be his best bet. – Gareth Wilson Aug 23 '12 at 16:56
@Griwes From my understanding, ICU is very heavy. All I need is a library that can handle some string manipulations like taking a substring. Does such a thing not exist? – Qman Aug 23 '12 at 16:57
I think they need to be somewhat 'heavy' given the nature of UTF-8 and UTF-16; a 'char' in UTF-8/16 doesn't always equal an actual Unicode character because of the encoding, and various combining characters so something seemingly 'simple' like substring isn't quite so simple. – Gareth Wilson Aug 23 '12 at 16:59
By heavy I meant that it had many additional libraries I was not planning to use. Is there a way to extract the string manipulation libraries and it's dependencies in a simple and clean way? – Qman Aug 23 '12 at 17:01
@QMan I'm not familiar with the libraries mentioned, so will sit the rest of this question out. I've tried to answer part of your question, at least on Windows, and it works for me - not sure why you're not having any luck. Hopefully someone else can step up with some more answers for you on the cross-platform side of things. – Gareth Wilson Aug 23 '12 at 17:20
1

@PavelRadzivilovsky I know, right! I do a lot of .Net stuff these days, and it's UTF-16 all the way! I thought it'd be the simplest way to go, at least on Windows given all the APIs use it, but it seems nearly everyone else disagrees - so a learning experience! – Gareth Wilson Aug 24 '12 at 12:26
well, my POV is to have UTF-16 where necessary, tightly wrapped around calling these APIs. In practice, despite that there are many diverse APIs, they don't seem to cover much of the execution time in life, unlike disk and communications which are few APIs but used a lot. – Pavel Radzivilovsky Aug 24 '12 at 21:20

UTF-8 String Iterators

2 Answers2

Tricking the compiler

Hex escapes

UTF-8 string literal prefix