Swedish characters don't compare correctly

Question

For some reason If/else statements isn't working correctly for me in C++

The problem is that when a variabel is equal to the right (höger), it won't output the If statement, instead it will go on to the else statement. If I replace the letter 'ö' with say 'o' so it becomes 'hoger' instead, then the if statement will work. So whenever I write the word 'höger' it won't go to the if statement, instead it will go to the else statement. However if I make the variabel equal to 'hoger', and then I write 'hoger', it will work. How can I make it possible writing 'höger' were the If statement recognizes it instead? It's as if Swedish letters don't work.

My code look like this:

#include <iostream>
#include <string>

using namespace std;


int main() {
    setlocale(LC_ALL,"");


    string test; // Define variabel
    cout << " Höger elle vänster"<<endl; // Right or left
    cin >> test;


    if(test == "höger") { // If right, then output this.

        cout <<"Du valde höger"<<endl;

    } 

    else if(test == "vänster") { // If left, then output this

        cout <<"Du valde vänster"<<endl;

    } else {

        // Do this

    }


}

What's the encoding of your source file? What's the encoding of your terminal? — John Zwinck, Apr 06 '14 at 11:16
what do you get trying std::cout << std::locale("").name(); in this program? — 4pie0, Apr 06 '14 at 11:19
Oh, and your logic is flawed. If the input is not `"höger"` doesn't automatically means it's `"vänster"`. What if the user inputs something else? — Some programmer dude, Apr 06 '14 at 11:20
this is just the test most probably for testing strange "o"... — 4pie0, Apr 06 '14 at 11:21
Btw I am curious, how can we use setlocale function without including the locale header? — Veritas, Apr 06 '14 at 11:34
[Kompilatorn gillar inte svenska](https://translate.google.com/#sv/en/kompilatorn%20gillar%20inte%20svenska) ;) — Emmet, Apr 06 '14 at 11:48
@Veritas: because C++, unlike C, does not offer any guarantees about which headers the system headers in turn include. the code is not portable, though. — Cheers and hth. - Alf, Apr 06 '14 at 12:07

david.pfx · Answer 1 · 2014-04-07T00:41:17.057

The problem is almost certainly to do with encodings.

The C/C++ language specs do not automatically handle anything other than 7 bit ASCII. The o-umlaut character is outside that range, and the exact behaviour depends on the encoding of your source code file.

The most likely possibilities are ISO 8859-1, Windows ANSI-1252, UTF-8 or Windows OEM 850. The first two encode this character the same, but in each of the others it is different.

With a bit more information about the encoding and tool set you are using it may be possible to provide more specific diagnosis and advice.

[And by the way, if/else statements in C/C++ work just fine, thank you.]

If we assume for the moment that this is Windows and Visual C++, then this is what you're dealing with.

Source code written inside Visual Studio: code page 1252. Code point for the o-umlaut character is 0xf6.
Keyboard input read from the console: code page 850. Code point for the o-umlaut character is 0x94.

Obviously not a good match. However, Visual Studio can also quite happily edit source code files in many encodings including UTF-8 (with byte mark), UTF-16 (wide characters) and code page 850. So:

Source code written inside Visual Studio: code page 850. Code point for the o-umlaut character is 0x94. Now it works.

You can also change the code page for your console using the CHCP command.

Change Console to CHCP 1252 and it works.

The behaviour of the compiler when reading source code is obliged by the standard to be consistent with the execution character set. See n3797 S2.2.5:

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set

S2.3/3:

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

n3797 S2.14.3/1:

A character literal that does not begin with u, U, or L is an ordinary character literal, also referred to as a narrow-character literal. An ordinary character literal that contains a single c-char representable in the execution character set has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set.

n3297 S2.14.5/6:

a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.

The execution character set is implementation-defined. Microsoft's statement reqarding implementation-defined behaviour for the C compiler is here: http://msdn.microsoft.com/en-us/library/hx3yt8af.aspx. [I can't find a separate one for C++, so I assume this applies to both.]

The source character set is the set of legal characters that can appear in source files. For Microsoft C, the source character set is the standard ASCII character set.

Sorry about the language-lawyer stuff, but what this says is that the MSVC compiler is independent of locale/encoding and implements 8-bit ASCII, code page unspecified. Obviously the standard library functions may need to know the encoding for various purposes, but that is a whole other story.

As a final point, the Microsoft C compiler dates back around 30 years, since before Windows. It has always been possible to write source code in code page 850 and have it run correctly on the console, subject to careful handling of extended (8-bit) characters. Many people still do. The problem here source code written in Windows-Ansi or Unicode and keyboard input from a OEM (cp850) console. Change either one to get it to work correctly.

Strongly misleading: "the exact behaviour depends on the encoding of your source code file". That is only the case with an incorrect compiler invocation. With Visual C++ the only practical way is to use UTF-8 encoding sans BOM, which is pretty unusual in Windows (many programs, not just the VC compiler, will misinterpret such a file). However, with g++ *the default* is to not check the narrow encoding, and the default g++ narrow execution character set is UTF_8, and so with e.g. MinGW g++ in Windows one can indeed get problems due to the source file encoding. g++ fix: specify the encoding. ;-) — Cheers and hth. - Alf, Apr 06 '14 at 12:53
-1 re "Source code written inside Visual Studio: code page 850. Code point for the o-umlaut character is 0x94. Now it works." is extremely ungood advice. It means that e.g. sorting and character classification will yield **incorrect** results, since the compiler assumes that the source code is encoded as Windows ANSI. Visual C++ determines the source encoding from the file contents, and recognizes UTF-8 with BOM as well as UCS2/UTF-16 -- but it has no way to recognize that it's being lied to about the source encoding. — Cheers and hth. - Alf, Apr 06 '14 at 16:45
@Cheersandhth.-Alf: Wrong. The compiler works just fine and makes no such assumptions. Library functions do the sorting and classification, and they need to know the correct locale/encoding to work correctly. — david.pfx, Apr 06 '14 at 22:59
Try to add a `setlocale( LC_ALL, "" )` at the start of `main`. That makes the output incorrect in a codepage 850 console. Never mind that sorting and classification also is incorrect, just note that even the output itself is wrong -- as are, of course, you. Wrong. Wrong. And ... wrong. Just completely and utterly wrong. Wrong. — Cheers and hth. - Alf, Apr 06 '14 at 23:23

Cheers and hth. - Alf · Answer 2 · 2014-04-06T19:52:12.113

In practice this problem will only manifest itself in Windows, so I'll assume Windows.

Then the problem is that the C++ narrow extended execution character set⁽¹⁾ (encoding) does not match the encoding used by the console window. "Narrow" refers to the char type. "Excecution character set" is a formal term employed by the C++ standard, and refers to the encoding that is assumed for text stored in the executable. The compiler translates source code literals to this encoding. It's also assumed for translation to/from any external encoding, such as translation to/from a console's encoding.

enter image description here

With Visual C++ the narrow encoding is always Windows ANSI⁽²⁾, regardless of source code encoding, unless you trick the compiler. And assuming you're using Visual C++, this is then one encoding that you know.

The encoding in the console window is by default the one used for original IBM PC, in your case probably codepage 850 (a Western European variant of the original IBM PC English codepage 437). Run the Windows command interpreter cmd (Windows-key+R, type cmd, OK). Type chcp to check the current codepage. Type chcp 1252 to switch to Windows ANSI Western, which presumably is the Windows ANSI codepage on your machine. Run your program [.exe] file, e.g. by typing its full path, or by going to its directory and typing just its name, e.g.

[H:\dev\test\0046]
> cl /nologo /EHsc /GR encoding.cpp /Fe:b.exe
encoding.cpp

[H:\dev\test\0046]
> chcp & b
Active code page: 850
 Höger elle vänster
höger
                             ^{← No output here, didn't compare as equal.}
[H:\dev\test\0046]
> chcp 1252
Active code page: 1252

[H:\dev\test\0046]
> b
 Höger elle vänster
höger
Du valde höger

[H:\dev\test\0046]
> _

… where cl (short for original “Lattice C”) is the Visual C++ compiler.

You can change the console codepage more permanently by running regedit, going to this registry key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

and in the list in the right pane double-click the value named OEMCP (short for Original Equipment Manufacturer Code Page, referring to the IBM PC), change it to 1252, or more generally to the same value as the ACP value, and reboot the machine.

Oh, it's also necessary to change the console window font to a TrueType font such as Lucida Console, because the default is (an emulation of) a bitmapped font that only works correctly with the original console codepage. You can right click the console window title to get a menu, choose [Defaults], and configure the default font, size, colors etc. The changes won't affect the current console window, but they will apply to future console windows, except for those that have been configured individually⁽³⁾.

An alternative to such console window configuration is to use the Console2 program. If you do, then in Windows 7 and later be sure to use the 64-bit version. Otherwise some things, such as invoking links to 64-bit programs, won't work.

Summing up, you can either

run the program from the command interpreter (using chcp to change the codepage), or
change the console codepage more permanently, as discussed above.

In either case it's a Good Idea™ to change the console window font to a TrueType font – and yes, this affects the functionality, not just the looks.

Note on additional Microsoft absurdity: in Windows 7 and later the "System" font used by default in console windows is actually, behind the scenes, a TrueType font with umpteen thousand glyphs, but it's used to emulate the old 16-bit Windows bitmapped fonts, with the same silly restrictions, so that you still have to change to some other TrueType font…

^{⁽¹⁾ See the C++11 standard §2.3/3.}

^{⁽²⁾ “Windows ANSI” depends on the Windows configuration and is always the codepage specified by the GetACP API function. In practice this function gets its value from the registry key/value referenced above. However, that's largely undocumented.}

^{⁽³⁾ In Windows XP Windows would ask if you wanted to save an individual console window configuration. Starting with Windows Vista it's saved with no question asked and no information that it's been saved. There is no user interface for removing such saved configurations, but they can be removed by programmatically altering shortcut files, and/or by registry editing, which however is both an impractical and brittle solution.}

Sorry for all the editing, but this is a ridiculously complex issue. I wish Microsoft could get their act together. Alas, it's apparently non-technical people in charge. — Cheers and hth. - Alf, Apr 06 '14 at 12:06
`With Visual C++ the narrow encoding is always Windows ANSI(2), regardless of source code encoding` -- not so. VC++ happily compiles code page 850, as well as UTF-8 and UTF-16. What goes in, comes out. — david.pfx, Apr 06 '14 at 14:54
@david.pfx: re "not so" and "what goes in, comes out". you're wrong. why do you post such **false claims** without even bothering to check. i flagged your comment as "not constructive". for it's absolutely not constructive to spread disinformation. — Cheers and hth. - Alf, Apr 06 '14 at 15:18
You said it, so prove it. Where does it say MSVC++ is Windows-Ansi only? What test would show that it is so? — david.pfx, Apr 06 '14 at 22:54
It's a cute little diagram, but did you test input written in code page 850? I did. — david.pfx, Apr 06 '14 at 22:55
For other readers, david.pfc is learning, but writes (misleadingly) as if he's knowledgable. @david.pfc: by encoding a Visual C++ source file in code page 850 and using non-ASCII letters, you're simply lying to the compiler. To see the general effect of that, namely garbled output, use `setlocale( LC_ALL, "" )` at start of `main`, as in the OP's code. This (undocumented, I think) causes VIsual C++'s runtime library to employ direct console i/o when it detects that the output goes to the console. With the assumption that the strings in the executable are Windows ANSI encoded. Bang, ouch foot. — Cheers and hth. - Alf, Apr 06 '14 at 23:00
@david.pfx: "Where does it say MSVC++ is Windows-Ansi only", that's a bit difficult, since it doesn't even say that it is ANSI. It's effectively undocumented, as much else. For Visual C++ 8.0, I think it was, there was patch that changed the narrow execution character set to UTF-8, but as far as I know that patch is not available for later versions. However, note that **wide character text**, the wide character execution set, is UTF-16. So that's the general solution for international text literals with Visual C++. — Cheers and hth. - Alf, Apr 06 '14 at 23:09
@david.pfx: re "what test would show that [the narrow execution character set is Windows ANSI]", the easiest is perhaps to use some non-English letters in a narrow string literal in UTF-8-with-BOM encoded source. Then check the resulting byte values in the string. The point is that with the BOM you *tell* the compiler that the source is UTF-encoded, and knowing that, it translates to Windows ANSI. For more information about execution character sets etc., note that this (except the specific encoding) is *specified by the C++ standard*. And the answer provides a reference to the particular para. — Cheers and hth. - Alf, Apr 06 '14 at 23:18
I find David's insinuation of vengeance voting (coinciding with a downvote of this answer), etc., etc., distasteful. Still, also for David, I just remembered a detail that might help with the undocumented nature of Visual C++'s narrow execution character set. Namely, the warning `international_hello.cpp(7) : warning C4566: character represented by universal-character-name '\uXXXX' cannot be represented in the current code page (1252)`, when compiling UTF-8 encoded source. In clearspeech it says the source text could not be converted to Windows ANSI. And yes, it uses the `GetACP` codepage. — Cheers and hth. - Alf, Apr 06 '14 at 23:56

score 0 · Answer 3 · answered Apr 06 '14 at 14:23

0

The only change I made to your code was the following:

// setlocale(LC_ALL, "");
char *l = setlocale(LC_ALL, NULL);
cout << "Current Locale: " << l << endl;

Because I don't have a “ISO” keyboard layout, I used the Alt code to type the character I need. The following the key combination I used for the different code pages.

First run I had to type in Alt+246 for Code page 437
Second run, Alt+148 for Windows-1252

Below is the output when I change code page between execution Output of program

answered Apr 06 '14 at 14:23

Black Frog

11,595
1
35
66

Note that the first example, with Alt 246 for codepage 437, is **not** the equivalent of typing an "ö" on a Scandinavian keyboard with that codepage active. On a Scandinavian PC, with a codepage 850 (emulated) system font, typing an "ö" displays as "ö", while character 246 in these codepages is an "÷", as shown in your screenshot. Also, note that `setlocale` is necessary in general for correct (narrow text) alphabetical ordering and character classification. – Cheers and hth. - Alf Apr 06 '14 at 16:11

score 0 · Answer 4 · answered Jan 15 '16 at 14:06

0

It seems the problem is the encoding of your source file when your IDE compiles it. If you are using Visual Studio you can change your encoding setting like this:

answered Jan 15 '16 at 14:06

BrandonFlynn-NB

384
2
14

Please don't reply with screenshots; rather give more precise informations. – Thomas Baruchel Jan 15 '16 at 14:23

Swedish characters don't compare correctly

4 Answers4

Linked