63

I've got a (generated) literal string in C++ that may contain characters that need to be escaped using the \x notation. For example:

char foo[] = "\xABEcho";

However, g++ (version 4.1.2 if it matters) throws an error:

test.cpp:1: error: hex escape sequence out of range

The compiler appears to be considering the Ec characters as part of the preceding hex number (because they look like hex digits). Since a four digit hex number won't fit in a char, an error is raised. Obviously for a wide string literal L"\xABEcho" the first character would be U+ABEC, followed by L"ho".

It seems this has changed sometime in the past couple of decades and I never noticed. I'm almost certain that old C compilers would only consider two hex digits after \x, and not look any further.

I can think of one workaround for this:

char foo[] = "\xAB""Echo";

but that's a bit ugly. So I have three questions:

  • When did this change?

  • Why doesn't the compiler only accept >2-digit hex escapes for wide string literals?

  • Is there a workaround that's less awkward than the above?

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • Just a guess, but I can see having at least four hex digits as being useful for wide character types. – Marvo Apr 26 '11 at 01:25
  • 2
    @jww, your workaround was already included in the question and was considered as ugly by the author. – maxschlepzig Jun 13 '15 at 05:56
  • The C++ Reference article on [Escape Sequences](http://en.cppreference.com/w/cpp/language/escape) summarizes the rules for the different styles (hex, octal, etc.) quite well. – maxschlepzig Jun 13 '15 at 06:25

6 Answers6

30

GCC is only following the standard. #877: "Each [...] hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence."

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • -1, Note that this answer is not correct. Only hexadecimal escape sequence is the longest sequence of hexadecimal digits. On the other hand, octal escape sequences are limited to up to three octal digits. That's how the standard dictates it. (C++11, $2.14.3 Character literals). – Wiz Jun 16 '13 at 17:05
  • @Wiz: You *do* know that 4.1.2 had [no support for C++11](http://gcc.gnu.org/projects/cxx0x.html), right? – Ignacio Vazquez-Abrams Jun 16 '13 at 20:09
  • @IgnacioVazquez-Abrams I am not sure what you mean by that. This rule has been in existence since the original C standard back in C89. It's the same in C89/C99/C11/C++98/C++11. I only happen to quote the latest standard, that's all. – Wiz Jun 16 '13 at 20:46
  • 1
    @Wiz: Does that then mean that the standard is contradicting itself? – Ignacio Vazquez-Abrams Jun 16 '13 at 20:47
  • I am not sure what you mean by that. The standard just says that octal escape sequence can be at most 3 octal digits while hex escape sequences have no upper limit as to their length. – Wiz Jun 16 '13 at 20:49
  • @Wiz: Oh, I see. You're fixated on the "octal" bit when the question doesn't even bring it up. I get it now. – Ignacio Vazquez-Abrams Jun 16 '13 at 20:52
  • The quoted text in your answer is simply wrong with respect to octal escape sequences. That's all I have an issue with. – Wiz Jun 16 '13 at 20:52
23

I have found answers to my questions:

  • C++ has always been this way (checked Stroustrup 3rd edition, didn't have any earlier). K&R 1st edition did not mention \x at all (the only character escapes available at that time were octal). K&R 2nd edition states:

    '\xhh'
    

    where hh is one or more hexadecimal digits (0...9, a...f, A...F).

    so it appears this behaviour has been around since ANSI C.

  • While it might be possible for the compiler to only accept >2 characters for wide string literals, this would unnecessarily complicate the grammar.

  • There is indeed a less awkward workaround:

    char foo[] = "\u00ABEcho";
    

    The \u escape accepts four hex digits always.

Update: The use of \u isn't quite applicable in all situations because most ASCII characters are (for some reason) not permitted to be specified using \u. Here's a snippet from GCC:

/* The standard permits $, @ and ` to be specified as UCNs.  We use
     hex escapes so that this also works with EBCDIC hosts.  */
  else if ((result < 0xa0
            && (result != 0x24 && result != 0x40 && result != 0x60))
           || (result & 0x80000000)
           || (result >= 0xD800 && result <= 0xDFFF))
    {
      cpp_error (pfile, CPP_DL_ERROR,
                 "%.*s is not a valid universal character",
                 (int) (str - base), base);
      result = 1;
    }
Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • 4
    Also `\u` is not really equivalent to `\x` in the sense that `\x` produces a particular integer value, whereas `\u` produces a certain ISO 10646 code point, so the numerical value depends on encoding. – Brian Bi Jul 25 '14 at 20:28
  • On some systems, a `char` may require three or four hex digits (or even more). While `CHAR_BIT` is usually eight, there are some systems still in production (such as digital signal processors) where `char` is some other size (16 probably being the most common size other than eight). – supercat Jun 10 '15 at 15:21
  • It's interesting that the number of hexadecimal digits in an escape is unbounded but the number of octal digits must be one, two, or three. And why the heck do the longer universal character names require eight digits when the first two must necessarily be 0? – Adrian McCarthy Feb 17 '18 at 23:01
5

I solved this by specifying the following char with \xnn too. Unfortunatly, you have to use this for as long as there are char in the [a..f] range. ex. "\xnneceg" is replaced by "\xnn\x65\x63\x65g"

mike b.
  • 51
  • 1
  • 2
  • there are better ways like [`\u00nnEcho`](https://stackoverflow.com/a/10220539/995714) or [`"\xnn" "Echo"`](https://stackoverflow.com/q/31239524/995714) – phuclv Aug 01 '18 at 15:12
5

I'm pretty sure that C++ has always been this way. In any case, CHAR_BIT may be greater than 8, in which case '\xABE' or '\xABEc' could be valid.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
-1

These are wide-character literals.

char foo[] = "\x00ABEcho";

Might be better.

Here's some information, not gcc, but still seems to apply.

http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.pgmgd.doc/cpprog624.htm

This link includes the important line:

Specifying \xnn in a wchar_t string literal is equivalent to specifying \x00nn

This may also be helpful.

http://www.gnu.org/s/hello/manual/libc/Extended-Char-Intro.html#Extended-Char-Intro

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • Doesn't change the behavior at all, the standard says "There is no limit to the number of digits in a hexadecimal sequence." So now "\x00ABEc" is treated as a single hexadecimal character. – Ben Voigt Apr 26 '11 at 01:34
  • @Ben Voigt: "Specifying `\xnn` in a wchar_t string literal is equivalent to specifying `\x00nn`". It seems that some compilers are at odds with your interpretation. – S.Lott Apr 26 '11 at 01:37
  • 2
    But what does it say about `\xnnn`? Is that considered equivalent to `\x00nnn`? – Ignacio Vazquez-Abrams Apr 26 '11 at 01:49
  • @Ignacio Vazquez-Abrams: Nothing. – S.Lott Apr 26 '11 at 02:10
-2

I also ran into this problem. I found that I could add a space at the end of the second hex digit and then get rid of the space by following the space with a backspace '\b'. Not exactly desirable but it seemed to work.

"Julius C\xE6sar the conqueror of the frana\xE7 \bais"

G.D.M.
  • 1