What characters are legal to use in string literals?

Question

I am wondering if it is legal in C to literally put ascii characters like TAB, BEL and ESC directly in a string literal.

There is no way to display the characters in plain text here on Stackoverflow so I had to take a screenshot instead.

example

Characters that does not have a graphical representation are display using Caret notation and highlighted in purple in the screenshot. There is also a TAB-character at line 7 that indents the text.

This compiles without any warnings using gcc -std=c99 -pedantic, but is it really fully portable?

This is not something that I would use for any serious programs. I am just curious if it the standards allow it.

Please don't post images of code. And e.g. `^I` for TAB is not a format recognized by C, the `^` character has no special meaning there (unlike in the tty world, where it kind of means ctrl). — unwind, Feb 12 '15 at 19:27
Take a look: http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11 — Valeri Atamaniouk, Feb 12 '15 at 19:36
To be clear: the source file being displayed here literally contains tab, newline, and page break characters (the latter rendered as a magenta `^L`). The literal `^I` after the word "using" isn't part of what elias is asking about. — , Feb 12 '15 at 20:01
@duskwuff is correct. The characters are display using "Caret notation". I had to use a screenshot because there is no way to display the characters in plain text here on Stackoverflow. I am sorry for the confusion and I have updated the question to make it clearer. — wefwefa3, Feb 12 '15 at 20:24

n. m. could be an AI · Accepted Answer · 2015-02-13T05:21:37.820

3

The portable characters that can apoear in the program source are exactly these:

the 26 uppercase letters of the Latin alphabet

A  B  C  D  E  F  G  H  I  J  K  L  M
N  O  P  Q  R  S  T  U  V  W  X  Y  Z

the 26 lowercase letters of the Latin alphabet

a  b  c  d  e  f  g  h  i  j  k  l  m
n  o  p  q  r  s  t  u  v  w  x  y  z

the 10 decimal digits

0  1  2  3  4  5  6  7  8  9

the following 29 graphic characters

!  "  #  %  &  '  (  )  *  +  ,  -  .  /  :
;  <  =  >  ?  [  \  ]  ^  _  {  |  }  ~

the space character, and control characters representing horizontal tab, vertical tab, and form feed.

Source: the C standard, any version.

An implementation must accept these characters, and is allowed to accept any additional characters.

edited Feb 13 '15 at 05:21

answered Feb 12 '15 at 20:34

n. m. could be an AI

112,515
14
128
243

C11 (from 2011) supports UTF-8 strings and that is not an optional feature; a compiler that doesn't support them is not C11 conforming. – Mecki Mar 18 '22 at 09:52
@Mecki A C11 implementation *may*, but does not *have to*, support *source characters* other than those in the basic character set. UTF8 strings can be written with just the basic source characters using universal character names (\u and \U). – n. m. could be an AI Mar 18 '22 at 10:06

score 1 · Answer 2 · answered Feb 12 '15 at 20:46

If a backslash precedes a literal newline character (not \n) immediately, both the backslash and the newline are removed. Lines can be split up like that everywhere except in between trigraphs (if a trigraph is split by a backslash-newline sequence, that sequence is removed, but the trigraph is left unchanged).

A literal tab character is allowed in a string literal (in portable code) and has the same semantics as \t. C11 (n1570) 6.4.5 p1 states, that "any member of the source character set except the double-quote ", backslash \, or new-line character" can be part of a string literal, and the tab character is part of the source character set (ibid. 5.2.1 p3).

The escape character (\e, ASCII 0x1b) isn't part of the source character set and even may not exist at all (on a non-ASCII system). Same holds for form feed, though \f is part of the C standard. These characters cannot be used portably.

An implementation is free to accept any character it pleases (additionally to the minimal requirements of the standard), the mapping from the source character set to the execution character set is implementation-defined (an implementation may map different characters in the source code to equal characters).

score 0 · Answer 3 · answered Feb 12 '15 at 19:33

0

A null-terminated string is simply a number of 8-bit values which could be 0-255 or -128-127 depending on their signedness.

When you send your bytes to something like a terminal it is all up to the terminal what to do with the bytes. Some bytes like 'a'-'z' might be standard, but only if you assume 8-bit character encoding. Other bytes like '€' might only be possible to present correctly with the right character set.

Finally we have those terminal control bytes to control the cursor and ring the bell. It is all up to the terminal to handle those bytes, but writing them will still be valid C code.

answered Feb 12 '15 at 19:33

Henrik Carlqvist

1,138
5
6

A null-*terminated* string may not *contain* the value `0`. – Jongware Feb 12 '15 at 19:52
And this doesn't really address the question of whether it's legal to have these characters in C source code, anyways. – Feb 12 '15 at 20:01
1

Also, a string may be a sequence of higher-bit values. `CHAR_BIT` is *at least* 8, but can be larger (in some historical systems, it is `9`; in many DSPs, it is `32`, though granted, the latter don't typically deal with strings at all). For wide strings, it is `sizeof(wchar_t) * CHAR_BIT`. – Tim Čas Feb 12 '15 at 20:40
@Jongware Detail: All _strings_ contain exactly 1 null character: `'\0'`. "A _string_ is a contiguous sequence of characters terminated by and including the first null character." C11 dr §7.1.1 1 – chux - Reinstate Monica Feb 13 '15 at 03:56
@Jongware I would say that a null-terminated string _must_ contain the value 0 at the end. – Henrik Carlqvist Feb 13 '15 at 20:38
I guess it's a matter of interpretation then. A null-terminated string contains a single zero, which both ends the string and kind of self-fulfills the description of itself as being "null-terminated" -- i.e. you cannot have a properly valid string which does not contain exactly *one* single zero. – Jongware Feb 13 '15 at 21:23

What characters are legal to use in string literals?

3 Answers3