Read UTF-8 file into UCS-4 string

Question

I am trying to read a UTF-8 encoded file into a UTF-32 (UCS-4) string. Basically internally I want a fixed size character internally to the application.

Here I want to make sure the translation is done as part of the stream processes (because that is what the Locale is supposed to be used for). Alternative questions have been posted to do the translation on the string (but this is wasteful as you have to do a translation phase in memory then you have to do a second pass to send it to the stream). By doing it with the locale in the stream you only have to do a single pass and there is not requirement for a copy to made (assuming you want to maintain the original).

This is what I tried.

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main()
{
    std::locale     converter(std::locale(), new std::codecvt_utf8<char32_t>);
    std::basic_ifstream<char32_t>   iFile;
    iFile.imbue(converter);
    iFile.open("test.data");

    std::u32string     line;
    while(std::getline(iFile, line))
    {
    }
}

Since thes are all standard types I was surprized with this compilation error:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'

            const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
                                        ^~~~~~~~~~~~~~~~~~~~~~~~~

Compiled with:

g++ -std=c++14 test.cpp

Possible duplicate of [C++ & Boost: encode/decode UTF-8](http://stackoverflow.com/questions/6140223/c-boost-encode-decode-utf-8) — nwellnhof, Jan 28 '16 at 15:25
@nwellnhof: This is definitely not a duplicated of the linked question. That question is about translating in memory in a string. I want to know the correct way of doing when passing it to the stream. — Martin York, Jan 28 '16 at 18:48
@LokiAstari: "*Basically internally I want a fixed size character internally to the application.*" Whatever advantage you believe that this provides you is not valid. At least, not as far as Unicode compliance is concerned. — Nicol Bolas, Jan 28 '16 at 18:50
@NicolBolas: Why not. UTF-32 is fixed size an UNICODE compliant. — Martin York, Jan 28 '16 at 18:52
@LokiAstari: "*I want to know the correct way of doing when passing it to the stream.*" It would be *immensely* faster for you to load the UTF-8 as is and convert it yourself than to use locale and `codecvt` facets. So while it's not *technically* a duplicate, any answer provided here will be less useful than answers provided there. — Nicol Bolas, Jan 28 '16 at 18:53
@NicolBolas: Why do you think it would be faster? I have not timed it yet. But I have a feeling it will be faster this way and use less memory (but testing will verify and I'll post my results). — Martin York, Jan 28 '16 at 18:54
@LokiAstari: "*Why not. UTF-32 is fixed size an UNICODE compliant.*" Note what I quoted of yours: "a fixed size character". Unicode does not allow such a thing. Thanks to combining characters and so forth, a single Unicode codepoint *does not* represent a visible grapheme cluster (character). And if you write code that assumes that a codepoint is a character, your code will be broken according to Unicode. — Nicol Bolas, Jan 28 '16 at 18:54
@NicolBolas: Yes I had forgot about that (the combination characters). That does apply to the `std::reversse()` that I use in the example answer. So that is a bad example. But not actually the point of the post. — Martin York, Jan 28 '16 at 19:03
@NicolBolas: Added timing results. It is definitely quicker to do it this way. **BUT** not a significant enough difference to make it the deciding factor. But in my opinion this is the superior technique. — Martin York, Jan 28 '16 at 19:41

Martin York · Accepted Answer · 2016-01-29T01:36:29.737

Seems like char32_t is not what I wanted. Simply moving to wchar_t worked for me. I suspect that this only works the way I want on Linux like system and Windows this conversion will be to UTF-16 (UCS-2) (but I can't test that).

int main()
{
   std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);

    // Input stream reads UTF-8 and converts to UTF-32 (UCS-4) String
    std::wifstream        iFile("test.data");
    iFile.imbue(utf8_to_utf32);

    // Output UTF-32 (UCS-4) string converts to UTF-8 stream
    std::wofstream        oFile("test.res");
    oFile.imbue(utf8_to_utf32);


    // Now just read like you would normally.
    std::wstring     line;
    while(std::getline(iFile, line))
    {
        // UTF-32 characters are fixed size.
        // So reverse is simple just do it in-place.
        std::reverse(std::begin(line), std::end(line));

        // UTF-32 unfortunately also has grapheme clusters (these are groups of characters
        // that are displayed as a single glyph). By doing the reverse above we have split
        // these incorrectly. We need to do a second pass to reverse the characters inside
        // each cluster. This is beyond the scope of this question and left as an excursive
        // (but I may come back to it later).
        oFile << line << "\n";
    }
}

A comment above suggested this would be slower than reading the data than translating it inline. So I did some tests:

// read1.cpp Translation in stream using codecvt and Locale

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>


int main()
{
    std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);

    std::wifstream        iFile("test.data");
    iFile.imbue(utf8_to_utf32);

    std::wofstream        oFile("test.res1");
    oFile.imbue(utf8_to_utf32);

    std::wstring     line;
    while(std::getline(iFile, line))
    {
        std::reverse(std::begin(line), std::end(line));
        oFile << line << "\n";
    }
}

// read2.cpp Translation using codecvt after reading.

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
#include <string>

int main()
{
    std::ifstream        iFile("test.data");
    std::ofstream        oFile("test.res2");

    std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_to_utf32;

    std::string     line;
    std::wstring    wideline;
    while(std::getline(iFile, line))
    {
        wideline = utf8_to_utf32.from_bytes(line);
        std::reverse(std::begin(wideline), std::end(wideline));
        oFile << utf8_to_utf32.to_bytes(wideline) << "\n";
    }
}

// read3.cpp Using UTF-8

#include <algorithm>
#include <iostream>
#include <string>
#include <fstream>

static bool is_lead(uint8_t ch) { return ch < 0x80 || ch >= 0xc0; }

/* Reverse a utf-8 string in-place */
void reverse_utf8(std::string& s) {
  std::reverse(s.begin(), s.end());
  for (auto p = s.begin(), end = s.end(); p != end; ) {
    auto q = p;
    p = std::find_if(p, end, is_lead);
    std::reverse(q, ++p);
  }
}

int main(int argc, char** argv)
{
    std::ifstream        iFile("test.data");
    std::ofstream        oFile("test.res3");

    std::string     line;
    while(std::getline(iFile, line))
    {
        reverse_utf8(line);
        oFile << line << "\n";
    }
    return 0;
}

The test file was 58M of unicode japanese

> ls -lah test.data
-rw-r--r--  1 loki  staff    58M Jan 28 11:28 test.data

> g++ -O3 -std=c++14 read1.cpp -o a1
> g++ -O3 -std=c++14 read2.cpp -o a2
> g++ -O3 -std=c++14 read3.cpp -o a3
>
> # This is the one using Locale in stream
> time ./a1

real    0m0.645s
user    0m0.521s
sys 0m0.108s
>
> # This is the one doing translation after reading.
> time ./a2

real    0m1.058s
user    0m0.916s
sys 0m0.123s
>
> # This is the one using UTF-8
> time ./a3

real    0m0.785s
user    0m0.663s
sys 0m0.104s

Doing the translation in stream is faster but not significantly so (not it was a lot of data). So choose the one that is easies to read.

fwiw, doing the utf-8 reversal in-place without conversions works out to be about 30% faster. (measured on a 7.5MB japanese corpus taken from project gutenberg, which I copied 16 times to make it big enough to measure). The guts of the code are here: http://coliru.stacked-crooked.com/a/c543ea86c86bb117 — rici, Jan 28 '16 at 22:20
@rici: Please show the actual code you used (so I can do a comparison). The code you linked does not do what you comment says. Also love a link to the corpus yo used. I got `ipsum lorem japanese` equivalent http://generator.lorem-ipsum.info/_japanese and then replicated it a lot of times to get the size I needed. — Martin York, Jan 29 '16 at 01:15
that is the actual function. Why do you say it doesn't do what ot says? The output is demonstrably reversed, no? Or am I missing something? — rici, Jan 29 '16 at 01:23
@rici: because it reads command line arguments. So you would have a hard time putting 7.5MB on the command line. — Martin York, Jan 29 '16 at 01:26
Yes, but i just call that function repetitively in a loop. `while (getline(in, line)) { reverse_utf8(line); out << line << '\n'; }`. (`in` and `out` are standard `ifstream` and `ofstream` objects, no use of locale.) — rici, Jan 29 '16 at 01:27
@rici: Yes you can. Is that what you did to read 7.5MB to get your timing of 30% faster? — Martin York, Jan 29 '16 at 01:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/101958/discussion-between-rici-and-loki-astari). — rici, Jan 29 '16 at 01:29
@rici: Added your code above. Did the same test. Its slightly slower than conversion in streams but slightly faster than doing the translation in place. — Martin York, Jan 29 '16 at 01:37
OK, I'm giving up on this benchmark. I've established that using Clang there is a difference of between 400 and 500% between my function using Gnu's stdlibc++ and Clang's libc++. The version compiled with g++ using stdlibc++ falls roughly in the middle. So three compilation environments, three radically different timings. I have no idea what causes the difference; I tried replacing std::reverse and std::find_if but the difference persists. — rici, Jan 29 '16 at 03:02

Read UTF-8 file into UCS-4 string

1 Answers1