String to Unicode, and Unicode to decimal code point (C++)

Question

Despite seing a lot of questions of the forum about unicode and string conversion (in C/C++) and Googling for hours on the topic, I still can't find a straight explanation to what seems to me like a very basic process. Here is what I want to do:

I have a string which potentially uses any characters of any possible language. Let's take cyrillic for example. So say I have: std::string str = "сапоги";
I want to loop over each character making up that string and:
- Know/print the character's Unicode value
- Convert that Unicode value to a decimal value

I really Googled that for hours and couldn't find a straight answer. If someone could show me how this could be done, it would be great.

EDIT

So I managed to get that far:

#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <locale>
#include <codecvt>
#include <iomanip>

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

int main()
{
    std::wstring test = L"сапоги";

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
    std::string u8str = conv1.to_bytes(test);
    hex_print(u8str);

    return 1;
}

Result:

04 41 04 30 04 3f 04 3e 04 33 04 38

Code

Which is correct (it maps to unicode). The problem is that I don't know whether I should use utf-8, 16 or something else (as pointed out by Chris in the comment). Is there a way I can find out about that? (whatever encoding it uses originally or whatever encoding needs to be used?)

EDIT 2

I thought I would address some of the comments with a second edit:

"Convert that Unicode value to a decimal value" Why?

I will explain why, but I also wanted to comment in a friendly way, that my problem was not 'why' but 'how';-). You can assume the OP has a reason for asking this question, yet of course, I understand people are curious as to why... so let me explain. The reason why I need all this is because I ultimately need to read the glyphs from a font file (TrueType OpenType doesn't matter). It happens that these files have a table called cmap that is some sort of associative array that maps the value of a character (in the form on a code point) to the index of the glyph in the font file. The code points in the table are not defined using the notation U+XXXX but directly in the decimal counterpart of that number (assuming the U+XXXX notation is the hexadecimal representation of a uint16 number [or U+XXXXXX if greater than uint16 but more on that later]). So in summary the letter г in Cyrillic ([gueu]) has code point value U+0433 which in decimal form is 1075. I need the value 1075 to do a lookup in the cmap table.

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    uint16_t i = 0, dec;
    for(unsigned char c : s) {
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
        dec = (i++ % 2 == 0) ? (c << 8) : (dec | c);
        printf("Unicode Value: U+%04x Decimal value of code point: %d\n", codePoint, codePoint);
    }
}

std::string is encoding-agnostic. It essentially stores bytes. std::wstring is weird, though also not defined to hold any specific encoding. In Windows, wchar_t is used for UTF-16

Yes exactly, I think when you understand that "while" you think (at least I did) that strings were just storing "ASCII" characters (hold on here), this appears to be really wrong. In fact std::string as suggested by the comment only seems to store 'bytes'. Though clearly if you look at the bytes of the string english you get:

std::string eng = "english";
hex_print(eng);
65 6e 67 6c 69 73 68

and if you do the same thing with "сапоги you get:

std::string cyrillic = "сапоги";
hex_print(cyrillic );
d1 81 d0 b0 d0 bf d0 be d0 b3 d0 b8

What I'd really like to know/understand is how is this conversion implicitly done? Why UTF-8 encoding here rather the UTF-16 and is there a possibility of changing that that (or is that defined by my IDE or OS?)? Clearly when I copy paste the string сапоги in my text editor, it actually copies an array of 12 bytes already (these 12 bytes could be utf-8 or utf-16).

I think there is a confusion between Unicode and encoding. Codepoint (AFAIK) is just a character code. UTF 16 gives you the code, so you can say your 0x0441 is a с codepoint in case of Cyrillic small letter es. To my understanding UTF16 maps one-to-one with Unicode codepoint which have a range of 1M and something characters. However, other encoding techniques, for example UTF-8 does not maps directly to Unicode codepoint. So, I guess, you better stick to the UTF-16

Exactly! I found this comment very useful indeed. Because yes, there is confusion (and I was confused) with regards to the fact that the way you encode the Unicode code point value has nothing to do with the Unicode value itself, well sort of because in fact things can be misleading as I will show now. You can indeed encode the string сапоги using UTF8 and you will get:

d1 81 d0 b0 d0 bf d0 be d0 b3 d0 b8

So clearly it has nothing to do with the Unicode values of the glyphs indeed. Now if you encode the same string using UTF16 you get:

04 41 04 30 04 3f 04 3e 04 33 04 38

Where 04 and 41 are indeed the two bytes (in Hexadecimal form) of the letter с ([se] in cyrillic). In this case at least, there is a direct mapping between the unicode value and its uint16 representation. And this is why (per Wiki's explanation [source]):

Both UTF-16 and UCS-2 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.

But as someone suggested in the comment, some code points values go beyond what you can define with 2 bytes. For example:

1D307 TETRAGRAM FOR FULL CIRCLE (Tai Xuan Jing Symbols)

which is what this comment was suggesting:

To my knowledge, UTF-16 doesn't cover all characters unless you use surrogate pairs. It was meant to originally, when 65k was more than enough, but that went out the window, making it an extremely awkward choice now

Though to be perfectly exact UTF-16 like UTF-8 CAN encode ALL characters though it can use up to 4 bytes for doing so (as you suggested it would use surrogate pairs if more than 2 bytes are needed).

I tried to do a conversion to UTF-32 using mbrtoc32 but cuchar is strangely missing on Mac.

BTW, if you don't know what a surrogate pair is (I didn't) there's a nice post about this on the forum.

Did you want to use something like `std::string str = L"сапоги"`? — πάντα ῥεῖ, Mar 05 '17 at 20:07
I don't know. My goal is to find the unicode value of each character making up the string and convert that to decimal value. — user18490, Mar 05 '17 at 20:07
This is a good read: http://reedbeta.com/blog/programmers-intro-to-unicode/ — tnt, Mar 05 '17 at 20:08
You'll need to know the encoding of the string (e.g., UTF-8) and preferably find a library that allows you to iterate over the code points then. — chris, Mar 05 '17 at 20:10
@πάντα ῥεῖ the description of my question is clear. Loop over characters of a string using cyrillic characters and print out the Unicode value of each one of these characters? What else do you need? — user18490, Mar 05 '17 at 20:20
What is the encoding of the string: code page (mbcs), one of the unicode encodings, some other encoding. Just telling us the language of the string give no clue to its encoding. If MBCS you need to know the code page. Some of the MBCS code pages: https://en.wikipedia.org/wiki/Variable-width_encoding#MBCS — Richard Critten, Mar 05 '17 at 20:28
@user18490 _"the description of my question is clear."_ No it isn't. There's a load of missing information. I'd recomend you post a [MCVE] that reproduces your problem. — πάντα ῥεῖ, Mar 05 '17 at 20:33
If you don't know the encoding of the original string, you can try to guess it, but such guessing can never be perfect. BOMs can help, but they don't have to be there. — chris, Mar 05 '17 at 20:38
Unicode is a mess because too many encoding have been introduced. But it's not a very big mess. You just need to read up on unicode code points, UTF-32, UTF-16 and UTF-8. UTF-8 is the encoding most used in practice because it is backwards-compatible with Ascii. — Malcolm McLean, Mar 05 '17 at 20:39
@MalcolmMcLean: I read about Unicode for a couple of hours)) maybe not enough. I understand that you can encode actually any character at all using whatever convention you want (utf8, 16 or 32). But I am confused about the C++ part. How does this information is internally stored in `wstring` or `string`. — user18490, Mar 05 '17 at 20:43
`std::string` is encoding-agnostic. It essentially stores bytes. `std::wstring` is weird, though also not defined to hold any specific encoding. In Windows, `wchar_t` is used for UTF-16. — chris, Mar 05 '17 at 20:56
I think there is a confusion between Unicode and encoding. Codepoint (AFAIK) is just a character code. UTF 16 gives you the code, so you can say your 0x0441 is a с codepoint in case of Cyrillic small letter es. To my understanding UTF16 maps one-to-one with Unicode codepoint which have a range of 1M and something characters. However, other encoding techniques, for example UTF-8 does not maps directly to Unicode codepoint. So, I guess, you better stick to the UTF-16 — kreuzerkrieg, Mar 05 '17 at 21:02
@kreuzerkrieg, To my knowledge, UTF-16 doesn't cover all characters unless you use surrogate pairs. It was meant to originally, when 65k was more than enough, but that went out the window, making it an extremely awkward choice now. — chris, Mar 05 '17 at 21:17
@chris, yup... then it encoded as two 16 bit code units. there is nothing sixteen in UTF-sixteen :) up to 0x110000 — kreuzerkrieg, Mar 05 '17 at 21:21
"Convert that Unicode value to a decimal value" Why? The convention to identify Unicode codepoints is like U+FFFF, U+FFFFF or [U+10FFFF](https://www.google.com/?q=U%2B10FFFF)—except maybe in HTML and XML 􏿿 but even there 􏿿 is easier to read. — Tom Blodget, Mar 06 '17 at 01:47
There is no such thing as "decimal value". Integers are integers, their values do not depend on the number of digits on your hands. Decimal *representation* is what normally appears on your screen when you *print* an integer. — n. m. could be an AI, Mar 06 '17 at 04:25
Also don't use UTF-16, unless you must interact with APIs that work in UTF-16 (such as Windows). UTF-8 is the preferred encoding for communication. Internally, use UTF-8 (u8"" strings, char8_t) unless you need to work with individual Unicode codepoints, in which case use UTF-32 (U"" strings, char32_t). This actually should be very rare, most applications never need to touch codepoints. — n. m. could be an AI, Mar 06 '17 at 04:32

Davislor · Accepted Answer · 2017-03-06T11:19:06.717

For your purposes, finding and printing the value of each character, you probably want to use char32_t, because that has no multi-byte strings or surrogate pairs and can be converted to decimal values just by casting to unsigned long. I would link to an example I wrote, but it sounds as if you want to solve this problem yourself.

C++14 directly supports the types char8_t, char16_t and char32_t, in addition to the legacy wchar_t that sometimes means UCS-32, sometimes UTF-16LE, sometimes UTF-16BE, sometimes something different. It also lets you store strings at runtime, no matter what character set you saved your source file in, in any of these formats with the u8", u" and U" prefixes, and the \uXXXX unicode escape as a fallback. For backward compatibility, you can encode UTF-8 with hex escape codes in an array of unsigned char.

Therefore, you can store the data in any format you want. You could also use the facet codecvt<wchar_t,char,mbstate_t>, which all locales are required to support. There are also the multi-byte string functions in <wchar.h> and <uchar.h>.

I highly recommend you store all new external data in UTF-8. This includes your source files! (Annoyingly, some older software still doesn’t support it.) It may also be convenient to use the same character set internally as your libraries, which will be UTF-16 (wchar_t) on Windows. If you need fixed-length characters that can hold any codepoint with no special cases, char32_t will be handy.

score 0 · Answer 2 · answered Mar 05 '17 at 21:15

Originally computers were designed for the American market and used Ascii - the American code for information interchange. This had 7 bit codes, and just the basic English letters and a few punctuation marks, plus codes at the lower end designed to drive paper and ink printer terminals. This became inadequate as computers developed and started to be used for language processing as much as for numerical work. The first thing that happened was that various expansions to 8 bits were proposed. This could either cover most of the decorated European characters (accents, etc) or it could give a series of basic graphics good for creating menus and panels, but you couldn't achieve both. There was still no way of representing non-Latin character sets like Greek. So a 16-bit code was proposed, and called Unicode. Microsoft adopted this very early and invented the wchar WCHAR (it has various identifiers) to hold international characters. However it emerged that 16 bits wasn't enough to hold all glyphs in common use, also the Unicode consortium intoducuced some minor incompatibilities with Microsoft's 16-bit code set.

So Unicode can be a series of 16-bit integers. That's wchar string. Ascii text now has zero characters between in the high bytes, so you can't pass a wide string to a function expectign Ascii. Since 16 bits was nearly but not quite enough, a 32 bit unicode set was also produced.

However when you saved unicode to a file, this created problems, was it 16 bit of 32 bit> And was it big-endian or little-endian. So a flag at the start of the data was proposed to remedy this. The problem was that the file contents, memorywise, no longer match the string contents.

C++ std:;string was templated so it could use basic chars or one of the wide types, almost always in practice Microsoft's 16 bit near-unicode encoding.

The UTF-8 was invented to come to the rescue. This a multi-byte variable length encoding, which uses the fact that ascii is only 7 bits. So if the high bit is set, it means that you have two, three, or four bytes in the character. Now a very large number of string are English language or mainly human-readable numbers, so essentially ascii. These strings are the same in Ascii as in UTF-8, which mkaes life a whole lot easier. You have no byte order convention problems. You do have the problem that you must decode the UTF-8 to code points with as not entirely trivial function, and remember to advance your read position by the correct number of bytes.

UTF-8 is really the answer, but the other encodings are still in use and you will come across them.

String to Unicode, and Unicode to decimal code point (C++)

EDIT

EDIT 2

2 Answers2