6

In visual studio C++ 2013 express it seems that unless utf8-encoded file has BOM mark, compiler fails to understand that the file being compiled is in UTF8 encoding and treats it as being in native encoding. Code editor, however, does not have this problem.

warning C4819: The file contains a character that cannot be represented in the current code page (932). Save the file in Unicode format to prevent data loss

Is there a fix for this behavior? I remember this being common problem in all visual studio versions, but I don't remember ever seeing a fix. I can't exactly keep adding bom marks to every file that is not mine, especially if source is maintained in code repository.

SigTerm
  • 26,089
  • 6
  • 66
  • 115
  • @sjdowling That isn't very helpful. Also, you're several years too late with welcome. – SigTerm Dec 18 '14 at 11:06
  • @HansPassant: I'm not japanese (can speak and read it a BIT, though), but I'm using that locale on my machine. – SigTerm Dec 18 '14 at 12:28
  • @HansPassant: My machine has japanese locale set. English text, shift-jis as default codepage. Text editor - in visual studio - correctly opens every file and identifies encoding as utf8. However, the moment I start compiling the file, I'll get dozens of those warnings, because compiler, unlike text editor, can't identify utf8 with no bom. Origin of file doens't matter, because ANY file with non-ascii symbol will trigger that warning. If I'm unlucky I'll get compile error instead (utf8 misrepresented as shift-jis can eat " symbols, meaning "new line in constant" and lots of other fun things). – SigTerm Dec 18 '14 at 12:35
  • Now I've successfully confused myself. The message suggests that compiler is trying to cram a string into shift-jis. However, in my experience what it actually does is trying to interpret the file being in shift-jis despite it being in utf8, because you get this warning even if offending char is within commented block. :-\ Need to think this one over. – SigTerm Dec 18 '14 at 12:40
  • @HansPassant: This is a compiler level 1 warning c4819, as documented on msdn. Not text editor warning. – SigTerm Dec 18 '14 at 12:51
  • Sorry, those comments were misleading. – Hans Passant Dec 18 '14 at 12:56
  • Not exactly a solution, but I've found that saving it with UTF8 BOM works well with other tools (VSCode, Git, etc.). These warnings shouldn't be ignored. I've had compile errors because there were some encoding issues, switching it from UTF8 to UTF8-BOM fixed it. – Gajo Petrovic Nov 13 '18 at 05:10

4 Answers4

2

Update to Visual Studio 2015. It supports new compiler options for source and execution characters sets.

You can use the /utf-8 option to specify both the source and execution character sets as encoded by using UTF-8. It is equivalent to specifying /source-charset:utf-8 /execution-charset:utf-8 on the command line. Any of these options also enables the /validate-charset option by default....

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you have specified a code page by using /utf-8 or the /source-charset option. Visual Studio allows you to save your C++ source code by using any of several character encodings....

Ref: https://msdn.microsoft.com/en-us/library/mt708821.aspx

Community
  • 1
  • 1
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

How should the compiler guess which encoding you intend the file to be interpreted with? That said, there are a few options:

  • Why not simply add a BOM? It's a pretty good way to mark a UTF-8 file, as it is very unlikely to be misinterpreted.
  • Other than that, I believe you can use a #pragma to tell MSC about the encoding, although I never used this myself.
  • Lastly, not using anything outside the basic character set is also an option which is slightly outdated in our world but still works reliable.
Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • No need to guess if the compiler had a command line switch like gcc has. The #pragma doesn't work, only the resource compiler understands it. And adding a BOM works for your own code, but is extremely annoying for that library you're want to use that was originally developed for a different compiler. – Sebastian Redl Dec 18 '14 at 12:14
  • Because project is not mine (meaning the moment I add bom I create a fork), it is fairly standard practice to use utf8 without bom for cross-platform projects, and there may be hundreds of those files. Besides, text editor identifies encoding correctly, despite not having BOM. Compiler could do the same thing. Anyway, at this point I would prefer global swithc of sorts. – SigTerm Dec 18 '14 at 12:32
0

To the date I have not encountered any solution to the problem.

If a fix for this behavior exists, apparently it is well guarded secret.

SigTerm
  • 26,089
  • 6
  • 66
  • 115
0

If your system locale is not English (E.g. Chinese or other language), a simple way to fix this is to change your system setting of 'Region and Language' to be English. Just follow steps below:

Control Panel -> Clock,Language,and Region -> Region and Language -> 
Administrative -> Language for non-Unicode programs -> Change system locale.

Is that simple? It fixes my problem as my system locale is Chinese. The description of 'Language for non-Unicode programs' is clear:

This setting (system locale) controls the language used when displaying 
text in programs that do not support Unicode.

More details in the image

I met this problem when I tried to build my project on Windows, which is successful on another Windows machine. I was crazy to modify all non-Unicode characters (All of then are comments) and so that compiler could move on. But there are too many files with this problem.

claymore
  • 69
  • 1
  • 5