4

It's a horrible experience for me to get understanding of unicodes, locales, wide characters and conversion.

I need to read a text file which contains Russian and English, Chinese and Ukrainian characters all at once

My approach is to read the file in byte-chunks, then operate on the chunk, on a separate thread for fast reading. (Link)

This is done using std::ifstream.read(myChunkBuffer, chunk_byteSize)

However, I understand that there is no way any character from my multi-lingual file can be represented via 255 combinations, if I stick to char.


For that matter I converted everything into wchar_t and hoped for the best.

I also know about Sys.setlocale(locale = "Russian") (Link) but doesn't it then interpret each character as Russian? I wouldn't know when to flip between my 4 languages as I am parsing my bytes.

On Windows OS, I can create a .txt file and write "Привет! Hello!" in the program Notepad++, which will save file and re-open with the same letters. Does it somehow secretly add invisible tokens after each character, to know when to interpret as Russian, and when as English?


My current understanding is: have everything as wchar_t (double-byte), interpret any file as UTF-16 (double-byte) - is it correct?

Also, I hope to keep the code cross-platform.

Sorry for noob

Kari
  • 1,244
  • 1
  • 13
  • 27
  • No worries. But what is the encoding of your _file_? Is it (as is usually the case these days) UTF-8? If so, there are things that can be done and I can advise. And if you _control_ the encoding of this file, well, use UTF-8! – Paul Sanders Jul 16 '18 at 04:14
  • Yes, I've checked with Notepad, the file is UTF-8; I guess my trouble comes from misunderstanding of where wchar_t and UTF-8 correlate (if they do). Or I am just complicating things and should just stick to char. I might have to work with UTF-16 as well, so the problem is related to understanding it as too ,not necessarily about some specific file – Kari Jul 16 '18 at 05:22
  • 1
    OK, there is a way. I will post some code in an hour or two, stand by, it's actually a lot easier than you think. – Paul Sanders Jul 16 '18 at 06:49

5 Answers5

6

Hokay, let's do this. Let's provide a practical solution to the specific problem of reading text from a UTF-8 encoded file and getting it into a wide string without losing any information.

Once we can do that, we should be OK because the utility functions presented here will handle all UTF-8 to wide-string conversion (and vice-versa) in general and that's the key thing you're missing.

So, first, how would you read in your data? Well, that's easy. Because, at one level, UTF-8 strings are just a sequence of chars, you can, for many purposes, simply treat them that way. So you just need to do what you would do for any text file, e.g.:

std::ifstream f;
f.open ("myfile.txt", std::ifstream::in);
if (!f.fail ())
{
    std::string utf8;
    f >> utf8;
    // ...
}

So far so good. That all looks easy enough.

But now, to make processing the string we just read in easier (because handling multi-byte strings in code is a total pain), we need to convert it to a so-called wide string before we try to do anything with it. There are actually a few flavours of these (because of the uncertainty surrounding just how 'wide' wchar_t actually is on any particular platform), but for now I'll stick with wchar_t to keep things simple, and doing that conversion is actually easier than you might think.

So, without further ado, here are your conversion functions (which is what you bought your ticket for):

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& wide_string)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (wide_string);
}

std::wstring widen (const std::string& utf8_string)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (utf8_string);
}

My, that was easy, why did those tickets cost so much in the first place?

I imagine that's all I really need to say. I think, from what you say in your question, that you already had a fair idea of what you wanted to be able to do, you just didn't know how to achieve it (and perhaps hadn't quite joined up all the dots yet) but just in case there is any lingering confusion, once you do have a wide string you can freely use all the methods of std::basic_string on it and everything will 'just work'. And if you need to convert to back to a UTF-8 string to (say) write it out to a file, well, that's trivial now.

Test program over at the most excellent Wandbox. I'll touch this post up later, there are still a few things to say. Time for breakfast now :) Please ask any questions in the comments.

Notes (added as an edit):

  • codecvt is deprecated in C++17 (not sure why), but if you limit its use to just those two functions then it's not really anything to worry about. One can always rewrite those if and when something better comes along (hint, hint, dear standards persons).
  • codecvt can, I believe, handle other character encodings, but as far as I'm concerned, who cares?
  • if std::wstring (which is based on wchar_t) doesn't cut it for you on your particular platform, then you can always use std::u16string or std::u32string.
Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
  • 1
    I also feel an important addition: if a file is read chunk-by chunk, conversion might fail for this "incomplete" two-byte character doesn't fit at the end of the chunk. Here is the work-around: https://stackoverflow.com/q/52338904/9007125 – Kari Oct 02 '18 at 12:21
3

The most important question is, what encoding that text file is in. It is most likely not a byte encoding, but Unicode of some sort (as there is no way to have Russian and Chinese in one file otherwise, AFAIK). So... run file <textfile.txt> or equivalent, or open the file in a hex editor, to determine encoding (could be UTF-8, UTF-16, UTF-32, something-else-entirely), and act appropriately.

wchar_t is, unfortunately, rather useless for portable coding. Back when Microsoft decided what that datatype should be, all Unicode characters fit into 16 bit, so that is what they went for. When Unicode was extended to 21 bit, Microsoft stuck with the definition they had, and eventually made their API work with UTF-16 encoding (which breaks the "wide" nature of wchar_). "The Unixes", on the other hand, made wchar_t 32 bit and use UTF-32 encoding, so...

Explaining the different encodings goes beyond the scope of a simple Q&A. There is an article by Joel Spolsky ("The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)") that does a reasonably good job of explaining Unicode though. There are other encodings out there, and I did a table that shows the ISO/IEC 8859 encodings and common Microsoft codepages side by side.

C++11 introduced char16_t (for UTF-16 encoded strings) and char32_t (for UTF-32 encoded strings), but several parts of the standard are not quite capable of handling Unicode correctly (toupper / tolower conversions, comparison that correctly handles normalized / unnormalized strings, ...). If you want the whole smack, the go-to library for handling all things Unicode (including conversion to / from Unicode to / from other encodings) in C/C++ is ICU.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • We have now established that the file is in UTF-8. Which is why I _asked_ that question instead of trying to guess. And that made _answering_ the question easy. – Paul Sanders Jul 16 '18 at 21:13
  • @PaulSanders: Yes... well... unfortunately there are a *lot* of things that will *not* "just work", not with wstring, not with u16string, and not with u32string, but.... that's definitely too much to get into in a SO answer. I still recommend using the ICU until C++ gets *real* Unicode support... – DevSolar Jul 16 '18 at 22:07
  • _unfortunately there are a lot of things that will not "just work", with wstring ..._ Such as? – Paul Sanders Jul 16 '18 at 22:21
  • 2
    @PaulSanders: `find()`, regular expressions etc. working for normalized vs. unnormalized code point sequences (e.g. `Ü` vs. diacritic + `U`, which *should* compare equal). Alternatively, *normalization* support. `substr()` not hacking grapheme clusters apart and leaving you with incomplete sequences. Non-simple uppercase / lowercase conversions, e.g. for German `ß` -> `SS`, or ligatures. Determining width of a string, the number of graphemes in it, for formatting. Word iteration. There is more but these are the ones I have to tackle at work, so... – DevSolar Jul 17 '18 at 06:42
  • 1
    Oh well, OK. Yes OP, if you need that level of sophistication in your code, take a look at [ICU](http://site.icu-project.org/home), and maybe also [this post](https://stackoverflow.com/questions/50413471/what-exactly-can-wchar-t-represent/50540643#50540643). – Paul Sanders Jul 17 '18 at 06:45
2

Unfortunately standard c++ does not have any real support for your situation. (e.g. unicode in c++-11)

You will need to use a text-handling library that does support it. Something like this one

KarlM
  • 1,614
  • 18
  • 28
  • 1
    For what the OP wants to do, that would be overkill. But if he wants to do fancier stuff then I think most people would recommend [ICU](http://site.icu-project.org/home). – Paul Sanders Jul 16 '18 at 21:12
1

And here's a second answer - about Microsoft's (lack of) standards compilance with regard to wchar_t - because, thanks to the standards committee hedging their bets, the situation with this is more confusing than it needs to be.

Just to be clear, wchar_t on Windows is only 16-bits wide and as we all know, there are many more Unicode characters than that these days, so, on the face of it, Windows is non-compliant (albeit, as we again all know, they do what they do for a reason).

So, moving on, I am indebted to Bo Persson for digging up this (emphasis mine):

The Standard says in [basic.fundamental]/5:

Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.

Hmmm. "Among the supported locales." What's that all about?

Well, I for one don't know, and nor, I suspect, is the person that wrote it. It's just been put in there to let Microsoft off the hook, simple as that. It's just double-speak.

As others have commented here (in effect), the standard is a mess. Someone should put something about this in there that other human beings can understand.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
0

The c++ standard defines wchar_t as a type which will support any code point. On linux this is true. MSVC violates the standard and defines it as a 16-bit integer, which is too small.

Therefore the only portable way to handle strings is to convert them from native strings to utf-8 on input and from utf-8 to native strings at the point of output.

You will of course need to use some #ifdef magic to select the correct conversion and I/O calls depending on the OS.

Non-adherence to standards is the reason we can't have nice things.

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • To be fair, Microsoft defined `wchar_t` as 16 bit when that *wass* enough for Unicode, and kept it that way so they would not break existing code. A poor choice IMHO, but not a deliberate violating of the standard at the time. – DevSolar Jul 16 '18 at 04:58
  • @DevSolar what you write is true. I did not write the Microsoft had deliberately violated the standard in this case (although they have in others). The lack of easily portable unicode support in the c++ standard is probably one of its most glaring omissions IMHO. – Richard Hodges Jul 16 '18 at 07:25
  • ICU does a somewhat decent job of it, although its rather un-C++-like API takes some getting used to. Unicode (as in, **all** of Unicode, not just the easy parts) is a very complex beast, and one that AFAIK most other languages also don't get 100% right. I know there are several designs for full Unicode support in the pipe for C++, but... let's say I am happy not having to discuss those in the committee... – DevSolar Jul 16 '18 at 07:45
  • @DevSolar I think the c++ committee would do us all a favour if it allowed "good enough to be commonly useful" to be the bar for acceptance and further allowed standards refinements as new information is discovered. The current situation leaves us having to adopt "good enough" 3rd-party libraries or write "good enough" code. While every other language has "good enough" support out of the box. Because of this last point, "good enough" is the de-facto standard of the industry in any case. – Richard Hodges Jul 16 '18 at 08:03