Despite seing a lot of questions of the forum about unicode and string conversion (in C/C++) and Googling for hours on the topic, I still can't find a straight explanation to what seems to me like a very basic process. Here is what I want to do:
I have a string which potentially uses any characters of any possible language. Let's take cyrillic for example. So say I have:
std::string str = "сапоги";
I want to loop over each character making up that string and:
- Know/print the character's Unicode value
- Convert that Unicode value to a decimal value
I really Googled that for hours and couldn't find a straight answer. If someone could show me how this could be done, it would be great.
EDIT
So I managed to get that far:
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <locale>
#include <codecvt>
#include <iomanip>
// utility function for output
void hex_print(const std::string& s)
{
std::cout << std::hex << std::setfill('0');
for(unsigned char c : s)
std::cout << std::setw(2) << static_cast<int>(c) << ' ';
std::cout << std::dec << '\n';
}
int main()
{
std::wstring test = L"сапоги";
std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
std::string u8str = conv1.to_bytes(test);
hex_print(u8str);
return 1;
}
Result:
04 41 04 30 04 3f 04 3e 04 33 04 38
Which is correct (it maps to unicode). The problem is that I don't know whether I should use utf-8, 16 or something else (as pointed out by Chris in the comment). Is there a way I can find out about that? (whatever encoding it uses originally or whatever encoding needs to be used?)
EDIT 2
I thought I would address some of the comments with a second edit:
"Convert that Unicode value to a decimal value" Why?
I will explain why, but I also wanted to comment in a friendly way, that my problem was not 'why' but 'how';-). You can assume the OP has a reason for asking this question, yet of course, I understand people are curious as to why... so let me explain. The reason why I need all this is because I ultimately need to read the glyphs from a font file (TrueType OpenType doesn't matter). It happens that these files have a table called cmap
that is some sort of associative array that maps the value of a character (in the form on a code point) to the index of the glyph in the font file. The code points in the table are not defined using the notation U+XXXX but directly in the decimal counterpart of that number (assuming the U+XXXX notation is the hexadecimal representation of a uint16 number [or U+XXXXXX if greater than uint16 but more on that later]). So in summary the letter г
in Cyrillic ([gueu]) has code point value U+0433
which in decimal form is 1075
. I need the value 1075
to do a lookup in the cmap
table.
// utility function for output
void hex_print(const std::string& s)
{
std::cout << std::hex << std::setfill('0');
uint16_t i = 0, dec;
for(unsigned char c : s) {
std::cout << std::setw(2) << static_cast<int>(c) << ' ';
dec = (i++ % 2 == 0) ? (c << 8) : (dec | c);
printf("Unicode Value: U+%04x Decimal value of code point: %d\n", codePoint, codePoint);
}
}
std::string is encoding-agnostic. It essentially stores bytes. std::wstring is weird, though also not defined to hold any specific encoding. In Windows, wchar_t is used for UTF-16
Yes exactly, I think when you understand that "while" you think (at least I did) that strings were just storing "ASCII" characters (hold on here), this appears to be really wrong. In fact std::string as suggested by the comment only seems to store 'bytes'. Though clearly if you look at the bytes of the string english
you get:
std::string eng = "english";
hex_print(eng);
65 6e 67 6c 69 73 68
and if you do the same thing with "сапоги you get:
std::string cyrillic = "сапоги";
hex_print(cyrillic );
d1 81 d0 b0 d0 bf d0 be d0 b3 d0 b8
What I'd really like to know/understand is how is this conversion implicitly done? Why UTF-8 encoding here rather the UTF-16 and is there a possibility of changing that that (or is that defined by my IDE or OS?)? Clearly when I copy paste the string сапоги in my text editor, it actually copies an array of 12 bytes already (these 12 bytes could be utf-8 or utf-16).
I think there is a confusion between Unicode and encoding. Codepoint (AFAIK) is just a character code. UTF 16 gives you the code, so you can say your 0x0441 is a с codepoint in case of Cyrillic small letter es. To my understanding UTF16 maps one-to-one with Unicode codepoint which have a range of 1M and something characters. However, other encoding techniques, for example UTF-8 does not maps directly to Unicode codepoint. So, I guess, you better stick to the UTF-16
Exactly! I found this comment very useful indeed. Because yes, there is confusion (and I was confused) with regards to the fact that the way you encode the Unicode code point value has nothing to do with the Unicode value itself, well sort of because in fact things can be misleading as I will show now. You can indeed encode the string сапоги
using UTF8 and you will get:
d1 81 d0 b0 d0 bf d0 be d0 b3 d0 b8
So clearly it has nothing to do with the Unicode values of the glyphs indeed. Now if you encode the same string using UTF16 you get:
04 41 04 30 04 3f 04 3e 04 33 04 38
Where 04 and 41 are indeed the two bytes (in Hexadecimal form) of the letter с
([se] in cyrillic). In this case at least, there is a direct mapping between the unicode value and its uint16 representation. And this is why (per Wiki's explanation [source]):
Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.
But as someone suggested in the comment, some code points values go beyond what you can define with 2 bytes. For example:
1D307 TETRAGRAM FOR FULL CIRCLE (Tai Xuan Jing Symbols)
which is what this comment was suggesting:
To my knowledge, UTF-16 doesn't cover all characters unless you use surrogate pairs. It was meant to originally, when 65k was more than enough, but that went out the window, making it an extremely awkward choice now
Though to be perfectly exact UTF-16 like UTF-8 CAN encode ALL characters though it can use up to 4 bytes for doing so (as you suggested it would use surrogate pairs if more than 2 bytes are needed).
I tried to do a conversion to UTF-32 using mbrtoc32
but cuchar
is strangely missing on Mac.
BTW, if you don't know what a surrogate pair
is (I didn't) there's a nice post about this on the forum.