2

I'm trying to write strings with non-ASCII characters in it to a file, such as "maçã", "pé", and so on.

I'm currently doing something like this:

_setmode(_fileno(stdout), _O_U16TEXT);

//I added the line above recently to the question,
//but it was in the code before, I forgot to write it
//I also included some header files, to be able to do that
//can't really remember which, if necessary I'll look it up.


wstring word=L"";
wstring file = L"example_file.txt"
vector<wstring> my_vector;

wofstream my_output(file);

while(word != L".")
{
 getline(wcin, word);
 if(word!= L".")
   my_vector.pushback(word);
}

for(std::vector<wstring>::iterator j=my_vector.begin(); j!=my_vector.end(); j++)
    {
        my_output << *j << endl;
//element pointed by iterator going through the whole vector

        my_output << L("maçã pé") << endl;
    }
my_output.close();

Now, if I enter "maçã", "pé" and "." as words (only the 1st two are stored in the vector), the output to the file is rather strange:

  • the words I entered (stored in variables) appear strange: "ma‡Æ" and "p,";
  • the words stored directly in the code appear perfectly normal "maçã pé";

I have tried using wcin >> word instead of getline(wcin, word) and writing to the console instead of a file, the results are the same: writes variable strings wrong, writes strings directly in code perfectly.

I cannot find a reason for this to happen, so any help will be greatly appreciated.

Edit: I am working in Windows 7, using Visual C++ 2010

Edit 2: added one more line of code, that I had missed. (right in the beginning)

EDIT 3: following SigTerm's suggestion, I realised the problem is with the input: neither wcin nor getline are getting the strings with right formatting to variable wstring word. So, the question is, do you know what is causing this or how to fix it?

Sampaio
  • 337
  • 2
  • 13
  • What operating system and compiler? – Alan Stokes Sep 28 '13 at 18:01
  • @AlanStokes Question updated. – Sampaio Sep 29 '13 at 00:09
  • Possible duplicate: http://stackoverflow.com/q/17808673/2230 (Not trying to self-promote :-) ) – Euro Micelli Sep 29 '13 at 00:10
  • @EuroMicelli Nop, not a duplicate. My unusual characters appear when I write them in my code, unlike yours. My problem is only when trying to output variable stored strings, not hard-coded ones. – Sampaio Sep 29 '13 at 00:39
  • Oh, I see. I think this is the same problem, but backwards. I'll write it up below. – Euro Micelli Sep 29 '13 at 04:02
  • Maybe CIN encoding is incorrect/incorrectly handled? Run code through debugger and make sure words are READ in correct encoding. Put breakpoint within `while(word != L".")`, investigate `word` after reading it. Visual studio should be able to display contents of wstrings. – SigTerm Sep 29 '13 at 17:52
  • @SigTerm Wow, I cannot believe I didn't think of that. Your hunch was correct, `word`is not getting the right characters, which makes writing them quite harder. Any idea on what is causing this and/or how to solve it? – Sampaio Oct 01 '13 at 20:13
  • @Sampaio: I don't know platform-idependent way to determine/change cin encoding (it might exist, though). If I were you, I'd try the same _setmode trick on CIN. Accoroding to [msdn](http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx), it might work. WinAPI has [plenty of console-related functions](http://msdn.microsoft.com/en-us/library/windows/desktop/ms682073(v=vs.85).aspx) but that's too platform-specific to my liking. It would probably be easier to just write output to file instead of dealing with terminal. – SigTerm Oct 01 '13 at 22:35
  • @Sampaio, `wcin` is interpreting bytes read from the console as cp1252, but the console is sending bytes as [cp437](http://en.wikipedia.org/wiki/Code_page_437). So when the console sends byte 87h (cp437 for `ç`), wcin uses the [cp1252](http://en.wikipedia.org/wiki/Windows-1252) table to convert it to Unicode character `‡` (87h in cp1252). My answer uses the `imbue` method to set the wcin stream to be interpreted as cp437 and will read the character correctly. Unfortunately, the `ã` character cannot be represented in cp437, so you'll have to switch the console code page to cp1252 to send it. – Mark Tolonen Oct 02 '13 at 03:23

3 Answers3

3

Try to include

#include <locale>

and at the beginning of main, write

std::locale::global(std::locale(""));
tomi.lee.jones
  • 1,563
  • 1
  • 15
  • 22
  • 1
    This could be the answer; by default the program is in the "C" locale and won't handle non-ASCII characters properly. This command uses the locale from the environment so is more likely to work. – Gavin Smith Sep 28 '13 at 18:28
  • Unfortunately i can't speak for other compilers than VS. This is how i solved the issue. Would be nice to hear which one OP is using. – tomi.lee.jones Sep 28 '13 at 19:58
  • Question edited with work environment information. I tried your solution, and even though the program compiled and ran, the end result was exactly the same. – Sampaio Sep 29 '13 at 00:07
  • Which compiler are You using? – tomi.lee.jones Sep 29 '13 at 00:27
  • http://stackoverflow.com/questions/3950718/wrote-to-a-file-using-stdwofstream-the-file-remained-empty Here may be an answer to your problem. Try the second answer, the one that suggests using `codecvt_utf8`. – tomi.lee.jones Sep 29 '13 at 01:02
  • That question seems different from mine, since the OP of that question can't output to a file a hard-coded string, which I can. However, I did try that solution, the one with `codecvt_utf8`, but I got the exact same result as with my original code – Sampaio Sep 29 '13 at 13:49
1

Windows makes encodings confusing because the console typically uses an "OEM" code page, while GUI applications use an "ANSI" code page. The each vary with the localized version of Windows used. On U.S. Windows, The OEM code page is 437 and the ANSI code page is 1252.

Keeping the above in mind, setting the streams to the locale being used fixes the problem. If working in the console, use the console's code page:

wcin.imbue(std::locale("English_United States.437"));
wcout.imbue(std::locale("English_United States.437"));

But keep in mind most code pages are single-byte encodings, so only understand 256 possible Unicode characters:

wstring word;
wcin.imbue(std::locale("English_United States.437"));
wcout.imbue(std::locale("English_United States.437"));
getline(wcin, word);
wcout << word << endl;
wcout << L"maçã pé" << endl;

This returns on the console:

maça pé
maça pé

Code page 437 doesn't contain ã.

You can use code page 1252 from the console if you:

  • Issue chcp 1252.
  • Use a TrueType console font like Consolas or Lucida Console.
  • Imbue the streams with English_United States.1252 instead.

Writing to a file has similar issues. If you view the file in Notepad, it uses the ANSI code page to interpret the bytes in the file. So even if a console app is using code page 437, Notepad will display the file incorrectly if written using the 437 code page. Writing the file in code page 1252 doesn't help either, because the two code pages don't interpret the same set of Unicode code points. Some answers to this problem are to get a different file viewer such as Notepad++ or write the file in UTF-8 which supports all Unicode characters.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • I have just seen your answer, and have to read it more in-depth, but here's the thing about the last part: Notepad is correctly displaying all the character that need to be displayed. However, it only displays them well when they're written in a certain way, which is the truly strange thing, and a not small part of the question. – Sampaio Oct 01 '13 at 20:07
  • Notepad can display all characters of Unicode, but can only write to a file the characters used by the encoding specified during save. The problem with your code is reading from a console that uses a different encoding (OEM) than the software is defaulting to (ANSI), so the characters are being interpreted incorrectly. – Mark Tolonen Oct 02 '13 at 03:09
  • I really need characters such as `ã`. That being said, `system("chcp 1252")` solved the problem, words and file are written and read perfectly. The console font does need to be changed to view the words properly **in the console**, but, other than that, everything seems great. Thank you. – Sampaio Oct 02 '13 at 15:59
0

You are having the opposite to the problem described here.

The core reason is the same: characters in the "ASCII"1 range 128-256 are less standardized than the characters in the range 32-127. Most Windows applications, whether they use "Unicode" or "ANSI" strings, use the same mapping between codes and characters as specified by Unicode. however, for mostly historical reasons, the console uses a separate map of codes-to-characters usually called the "codepage". The exact table used depends of the language and configuration of Windows. For US English computers, that's the OEM 437 Code Page.

When you type ç in the console, you are really entering character code 135, because that's the code assigned to that character in the 437 code page used by the console. The rest of Windows interprets that character code as described in the Unicode tables as character .

You can use OemToChar (documentation here) to convert text entered via the console to the corresponding string in Unicode encoding.

See my answer here for other background information.


1 yes, this range is technical not ASCII, but close enough. I'm also using the usual informal (and technically wrong) definition of Unicode throughout.

Community
  • 1
  • 1
Euro Micelli
  • 33,285
  • 8
  • 51
  • 70
  • I have read both your answers, and I noticed the explanation relies mostly on the *console's* code page. Just to make sure, my problem also happens when writing to a file. Also, the console displays the "ç" character just fine, as long as the character isn't stored in a variable: *wcout << L"ç";* outputs ç, nice and clean. I will check out OemToChar, see if that works. – Sampaio Sep 29 '13 at 13:18
  • @Sampaio, the console uses CP437, while VS uses 'Unicode' (32-bit strings) or 'Windows 1252' (16-bit). Your text editor is also using Windows-1252 or Unicode, so the file "matches" what you see in VS. Run `TYPE example_file.txt` from the console and you'll see that there, what you typed in the console looks "right" while the liberals from code look "wrong". The file is just a bunch of bytes; it's the program that gives it textual meaning when it chooses a codepage to map the bytes to text. Also, try an editor like Textpad that gives the option to read a file as "DOS" to see both encodings. – Euro Micelli Sep 29 '13 at 14:19
  • @Sampaio: incidentally, neither is "wrong". They are just "different" and the 'Unicode/Windows-1252' one is far more common these days. You just need to be aware of the difference, and know which one you want. In your case, you want to see the Unicode version, so you need to convert text that you receive if it comes in other encodings. – Euro Micelli Sep 29 '13 at 14:24
  • Oh, you say `wcout << L"ç";` works, eh? I wouldn't have expected that. Weird. I'll check it out when I get to my office tomorrow. Please let me know how it goes. – Euro Micelli Sep 29 '13 at 14:28