1

I am working on a legacy code that uses MFC and Visual C++ libraries. I need to create a vector of CString that can hold Cyrillic words as shown in the following sample code snippet:

std::vector<CString> vecContainingCyrillicWords;
vecContainingCyrillicWords.push_back(_T("English Words"));
vecContainingCyrillicWords.push_back(_T("Русские слова"));

PROBLEM: As seen in the debugging result below, the Cyrillic word is not correctly set. How do I assign a Cyrillic word to a variable of type CString?

enter image description here

user2756695
  • 676
  • 1
  • 7
  • 22
  • 1
    What's the concrete type of `CString`? What does `_T` expand to? Does the code still compile when you define the preprocessor symbol `_CSTRING_DISABLE_NARROW_WIDE_CONVERSION` on the command line? – IInspectable Apr 16 '21 at 12:15
  • That can work only if either you have your OS set to a Cyrillic codepage or you build the application with `-D_UNICODE` (or both). – j6t Apr 16 '21 at 12:16
  • 1
    I don't reproduce your problem (Visual Studio 2019): https://i.imgur.com/1HXD6Vj.png but make sure the source file is saved with a unicode capable format, UTF8 with BOM for example. – Simon Mourier Apr 17 '21 at 08:34
  • Go to "project properties", set "character Set" to Unicode. However, you may run in to other problems if you are modifying existing ANSI project. – Barmak Shemirani Apr 17 '21 at 20:24

2 Answers2

1

The fact that the debugger doesn't display it properly does not mean that it has not been assigned properly to the CString.

Exactly what is in the CString depends whether you are set up for Unicode or not. Check whether the MBCS symbol or the UNICODE symbol is defined at compile time. If not Unicode, the characters are one byte ASCII values and when you want to display/print the characters, you will have to make sure that you use the right codepage. The debugger appears to be displaying the non-English characters incorrectly, possibly interpreting Unicode as if it were one byte characters, or interpreting the high ASCII values of single byte characters according to a European codepage (likely 1252) instead of the codepage supporting Cyrillic (usually 1251). The latter possibility is what it looks like to me, but I can not be certain without knowing whether you have Unicode defined.

Andrew Truckle
  • 17,769
  • 16
  • 66
  • 164
Basya
  • 1,477
  • 1
  • 12
  • 22
  • *"If not Unicode, the characters are one byte ASCII values"* - Nope, not ASCII. Those are assumed to be **ANSI**. While Microsoft's CRT supports [3 kinds of character sets](https://learn.microsoft.com/en-us/cpp/c-runtime-library/using-generic-text-mappings) (ASCII, ANSI, Unicode), MFC's `CString` cannot distinguish between ANSI and ASCII, and assumes ANSI. – IInspectable Apr 16 '21 at 14:27
  • IIRC, ASCII including extended ASCII (that is, up to 255, not 127) and ANSI are different ways of saying the same thing. If you look up "ASCII codepage 1252" or "ANSI codepage 1252" you'll get the same thing. – Basya Apr 17 '21 at 21:07
  • No, they are not. ASCII is the character encoding that assigns meaning strictly to the numeric values 0 through 127. There are only a few functions in the CRT where the distinction between ASCII and ANSI (or codepage encoding) matters, and Microsoft's implementation acknowledges this by supporting all 3 character sets. – IInspectable Apr 17 '21 at 21:26
  • Just search for "extended ASCII" and you'll see what I mean. Also, look at one of the comments on the accepted answer here: https://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences, where he says, "The term "ANSI" when applied to Microsoft's 8-bit code pages is a misnomer. They were based on drafts submitted for ANSI standardization, but ANSI itself never standardized them." You can see extended ASCII (with some explanation) here: https://www.ascii-code.com/. Etc. – Basya Apr 18 '21 at 07:29
  • 1
    And, in any case, this is a semantic discussion that has little or no bearing on helping the OP with his problem. – Basya Apr 18 '21 at 07:30
  • I'm not claiming that "ANSI" had a well defined meaning. It does, though, in Microsoft's documentation and implementation, just like "ASCII". They are **distinct** character encodings, with different properties. The Windows API only supports 2 character encodings (ANSI and Unicode). Microsoft's CRT supports 3. The OP is using the `_T` macro, which is part of the CRT (as opposed to `TEXT`), so this distinction matters. – IInspectable Apr 18 '21 at 07:44
1

The problem was in the file encoding type. The file was already saved with utf-8 encoding and therefore, I was expecting correct results. But I had to change the file encoding to UTF8 with BOM and then the correct values were pushed to the vector.

Also suggested by @Simon Mourier in a comment above!

user2756695
  • 676
  • 1
  • 7
  • 22
  • You should mention who gave you that idea. – Simon Mourier Apr 20 '21 at 07:28
  • @SimonMourier: Actually I was just trying different encodings myself. But now I see that you had also suggested this idea in one of your comments so, I have updated my answer and mentioned you :) – user2756695 May 06 '21 at 11:28