Is it possible to get GCC to compile UTF-8 with BOM source files?

Question

I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on Ubuntu Linux.

In Visual Studio, I can use Unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).

For example:

// A = π.r²
double π = 3.14;

GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:

wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program

wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program

Which brings me to the question:

Is there a way to get GCC to compile UTF-8 files without first removing the BOM?

I'm using:

Windows 7
Visual Studio 2010

and:

Ubuntu 11.10 (Oneiric Ocelot)
GCC 4.6.1, 2011-06-27 (as provided by apt-get install gcc)

As the first commenter pointed out, my problem was not the BOM, but having non-ASCII characters outside of string constants. GCC does not like non-ASCII characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.

Works fine for me in gcc 4.4.5, using a string containing both the UNICODE characters in your question. File with BOM. Also, the error you get has nothing to do with the BOM, but seems to be that the UNICODE characters in question is outside any string (thats why they are called _stray_.) — Some programmer dude, Oct 26 '11 at 08:25
@JoachimPileborg yes the unicode characters are outside of the string, the "π" I was using as a symbol name, the "²" was just in comments. When I remove the BOM, it does eliminate the error from the console output, but I guess that's no gaurantee that GCC is really handling the characters how I expect. — Boinst, Oct 26 '11 at 15:35
@JoachimPileborg, I've updated the question to include the context in which I'm using the unicode characters. — Boinst, Oct 26 '11 at 15:38
It is an error to have a BOM in a UTF-8 stream, because it precluded catting three of them together and getting the correct result. — tchrist, Oct 31 '11 at 01:27
clang supports these symbols in identifiers, gcc only supports in strings,To use Λ (greek lambda) in identifiers in gcc use universal character name (https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_74/rzarg/unicode_standard.htm), so a function `funΛ()`, would be written as `fun\u039B()` to be able to run in gcc. I changed my compiler to clang, and things worked fine. gcc's `-finput-charset=UTF-8 -fextended-identifiers` don't help either. `-fextended-identifiers` is simply for supporting universal character name format, if turn off`(-fno-extended-identifiers)` even `fun\u039B()` fails. — Sahil Singh, Jul 16 '19 at 06:10

score 4 · Answer 1 · edited May 08 '23 at 23:50

While Unicode identifiers are supported in GCC, UTF-8 input is not. Therefore, Unicode identifiers have to be encoded using \uXXXX and \UXXXXXXXX escape codes. However, a simple one-line patch to the C++ preprocessor allows GCC and g++ to process UTF-8 input provided a recent version of iconv that support C99 conversions is also installed. Details are present at UTF-8 Identifiers in GCC.

However, the patch is so simple it can be given right here:

diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c

Output:

*** gcc-5.2.0/libcpp/charset.c  Mon Jan  5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c  Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;
--- 1711,1717 ----
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, "C99", input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;

Even with the patch, two command line options (-finput-charset and -fextended-identifiers) are needed to enable UTF-8 input. In particular, try something like

/usr/local/gcc-5.2/bin/gcc \
    -finput-charset=UTF-8 -fextended-identifiers \
    -o circle circle.c

score 4 · Accepted Answer · edited May 23 '17 at 12:16

4

According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:

perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'

See also g++ unicode variable name and Unicode Identifiers and Source Code in C++11?

edited May 23 '17 at 12:16

Community

1
1

answered Oct 26 '11 at 15:44

Adrian Cox

6,204
5
41
68

[GCC caught up in version 10](https://stackoverflow.com/questions/12692067/and-other-unicode-characters-in-identifiers-not-allowed-by-g/42158646#42158646) (mid 2020). – Peter Mortensen May 08 '23 at 21:07

Is it possible to get GCC to compile UTF-8 with BOM source files?

2 Answers2

Linked

Related