I've spent many hours now reading about Unicode, its encodings and many related topics.
The reason behind my research is because I am trying to read the contents of a file and parse them character by character.
Correct me if I am wrong please:
- C++'s
getc()
returns anint
which might equalEOF
.
If the return value does not equalEOF
it can beinterpreted as asafely assigned to achar
.
Sincestd::string
is based onchar
we can buildstd::string
s with these chars and use those.
I have a C# background where we use C#'s char
(16bit) for string
s.
The value of these char
s map directly to unicode values.
A char
whose value is 5
is equal to the unicode character located at U+0005
.
What I don't understand is how to read a file in C++ that contains characters whose values might be larger than a byte. I don't feel comfortable using getc()
when I can only read characters whose values are limited to a byte.
I might be missing an important point on how to correctly read files with C++.
Any insights are very much appreciated.
I am running a Windows 10 x64 using VC++.
But I'd prefer to keep this question platform-independent if that is possible.
EDIT
I'd like to emphasize a stack overflow post linked in the comments by Klitos Kyriacou:
How well is Unicode supported in C++11?
It's a quick dive into how bad Unicode is supported in C++.
For more details you should read/watch the resources provided in the accepted answer.