How to deal with the Unicode characters in C++

Question

We have a commenting system built into our engine that allows programmers to put comments for various exposed variables/objects which are then used by the GUI front-end for tool-tips and help.

Recently, certain tool-tips started crashing, and after much wasted time I tracked it down to the the character: ’ which, unless I am mistaken, is a unicode character and not available in ASCII.

Taking this answer into consideration, I assumed wstring would fix the problem. Before making changes in the bigger project, I created a test project to see if wstring would solve the issue. Although the project doesn't crash, the behavior is not as expected for wstring.

#include <iostream>
#include <string>

using namespace std;

int main()
{
    string someString = "successive attack that DOESN’T result";
    wstring someWString = L"successive attack that DOESN’T result";

    cout << someString << endl;
    wcout << someWString << endl;

    return 0;
}

//Console Output//
successive attack that DOESNÆT result 
successive attack that DOESNPress any key to continue . . .

I read this article quite some time ago and thought I understood the problems associated with character sets, but that is obviously not the case. I would appreciate a solution to this problem as well as a good explanation of what is happening and how to avoid problems similar to this in the future.

Maybe the source file itself isn't encoded properly. What's its encoding? — Niklas B., Feb 10 '12 at 16:08
IIRC the console doesn't support non-code-page characters well. Do your tools tips work? — Rup, Feb 10 '12 at 16:10
@NiklasB.: I am not sure how I would check that? I am using Visual Studio 2008 to create a new project and the source file in the above example. I am not sure how I would check the encoding for the source file itself...? In project properties I have tried both `Use Multi-byte Character Set` and `Use Unicode Character Set` with no difference in the output. — Samaursa, Feb 10 '12 at 16:13
@Rup: I have to modify quite a bit of code to make it work with `wstring` so I thought I'd try on a smaller project before making the changes and finding out they did not fix the problem. — Samaursa, Feb 10 '12 at 16:14

score 4 · Accepted Answer · answered Feb 10 '12 at 16:22

4

Since you are using Visual Studio I assume you are using Windows. The Windows console does not support unicode. It uses the OEM char set. You can convert between the two using CharToOemW/OemToCharW. Obviously it will not be able to represent all unicode characters.

Windows uses UTF16 for its system API. If your tooltips uses the Windows API it is probably wstring that you want to use. However, you can use UTF8 instead and convert this to UTF16 before calling the Windows API. This conversion can be performed using MultiByteToWideChar/WideCharToMultiByte.

answered Feb 10 '12 at 16:22

rasmus

3,136
17
22

Is there a temporary fix I can do just to get a fixed build out (e.g. ignore the unicode character as soon as it is encountered)? I will then start converting all strings to `wstring` (which is going to take quite some time). – Samaursa Feb 10 '12 at 16:27
If you skip all characters with value > 127, you will only get ASCII characters. – rasmus Feb 10 '12 at 16:29
1

What favors UTF8 is that you can continue to use regular strings, i.e., you do not need to convert all your strings to wstring. Instead you need to convert when calling the unicode (UTF16) Windows API. – rasmus Feb 10 '12 at 16:45
I am still confused about something. The problem character can be represented in a `char` variable. It will not show up as `’` but will show up as `Æ` ... why does that result in a crash? Any guess as to what could possibly be going wrong in the code when encountering this character? – Samaursa Feb 10 '12 at 17:12
Without knowing how your code process these strings it is hard to say. Perhaps you can provide more information? In general, this character is probably outside the char set that your code supports and does not handle this case gracefully. Æ is only the interpretation of the char in the OEM char set. This is probably not what your code uses. – rasmus Feb 10 '12 at 17:19
I agree, it is hard to say without the code (the codebase is huge). I am not sure which information to provide as I cannot track down where the code fails to deal with the character. Thanks for the help/explanation as it helped me better understand the problem. – Samaursa Feb 10 '12 at 17:54

score 1 · Answer 2 · answered Feb 10 '12 at 16:22

1

Since you are dealing with Unicode characters, it would be appropriate if you set Character Set to Use Unicode Character Set in projects properties.

Another possible problem could be the encoding of source files. The best practice while working with Unicode characters is to have your source files encoded in UTF-8, especially files where you define string literals like this one. Note that UTF-8 without BOM could be troublesome because Visual Studio needs this BOM so that it can intepret files content properly. Convert your files (I use Notepad++ for this) and convert it so that they are encoded in UTF-8

answered Feb 10 '12 at 16:22

LihO

41,190
11
99
167

I tried the same in NPP (saving as UTF-8 or UCS-2) and it doesn't help (although I used the raw `cl` without VS). I think the problem is that the Console doesn't understand the output. – Niklas B. Feb 10 '12 at 16:23
My experience is that if program uses Unicode Character Set and doesn't display string literals correctly, it is most likely because of bad encoding of source files. – LihO Feb 10 '12 at 16:26
Yeah, I thought the same (see my comment), but I *tried it out just now* and it's not the problem. – Niklas B. Feb 10 '12 at 16:26

How to deal with the Unicode characters in C++

2 Answers2

Linked