4

I have a zero terminated string:

char* s = ...;

and I am generating C source code (at runtime) and I want to output a string literal representing s that will produce an identical string to s in the generated C program.

The algorithm I am using is:

Output "

Foreach char c in s
    if c == " output \"
    else if c == \ output \\
    else output c

Output "

Are there any other characters that I need to give special treatment other than " and \?

Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • There are plenty others. Unicode, single quotes, new lines, etc. all require special handling – Richard J. Ross III Aug 31 '12 at 04:14
  • What if `s` contains an escaped \"? – dda Aug 31 '12 at 04:14
  • @dda: Than it will correctly be encoded as `"...\\\"..."` – Andrew Tomazos Aug 31 '12 at 04:24
  • @RichardJ.RossIII: Won't unicode and single quotes be preserved? Do they need escaping? – Andrew Tomazos Aug 31 '12 at 04:30
  • @Andrew: I think the problem Richard refers to is that the source character set of a C implementation is not necessarily the same as the execution character set. Which is a fancy way of saying that just because `char` can contain certain characters (for example Latin 1 or UTF-8) doesn't necessarily mean that source files can. Also the more obvious point you didn't question, that newlines need special treatment. ASCII `0x27` single-quote doesn't need special treatment, but the "curved" quote at `0x92` in Windows CP-1252 and `\u2019` might. – Steve Jessop Aug 31 '12 at 08:15
  • @SteveJessop: http://stackoverflow.com/questions/12216946/gcc-4-7-source-character-encoding-and-execution-character-encoding-for-string-li – Andrew Tomazos Aug 31 '12 at 14:01
  • And aside from whether a particular implementation's charsets can be different, you don't actually *say* in the question that the output code is going to be compiled using the same implementation that the first program ran on :-) – Steve Jessop Aug 31 '12 at 14:13
  • @SteveJessop: In my particular case both the build and target compiler are gcc 4.7/linux/x86_64. As it happens the input and output data in my project are in UTF-8 (as all 8-bit character data should be in these days) - it turns out this is also the default src and exec encoding of gcc - so everything fits together nicely without transcoding or escaping. But worth looking into anyway, thanks. – Andrew Tomazos Aug 31 '12 at 22:47

2 Answers2

9
  • You must encode ", \, \r and \n and \0 (and \? as Michael Burr mentions). Failure to do this will break your code.
  • You should encode non-ASCII characters using the hexadecimal escape code, e.g. \x80. It is implementation defined if you have non-ASCII characters in your source code. Failure to encode these characters will work on some compilers but it could break on others.
  • You can encode ASCII non-printable characters. It would improve the readability of the generated source code if you used the escape codes for characters like \t, \b, \x05, etc. If you don't do this your code will still work but it might be hard to read.
  • You don't need to escape ' inside a double-quoted string. It's legal, but it's unnecessary and it doesn't make the source code more readable.
Community
  • 1
  • 1
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • If there are bytes in a C string literal from 0x80 to 0xFF than aren't they they preserved as-is? – Andrew Tomazos Aug 31 '12 at 04:28
  • @AndrewTomazos-Fathomling: I believe that this is implementation dependent. It will probably work, but it's not wise to rely on it. – Mark Byers Aug 31 '12 at 04:41
  • I believe that an implementation is permitted to interpret source files as (for example) UTF-8. If you wrote an ISO-Latin or CP1252 string "as is" into the source file, then you lose, unless after writing the source file you then transcode it. I believe an implementation is also permitted to interpret source files as pure ASCII, and reject the file if it contains any bytes not in the list of source characters required by the standard. Then you just lose full stop, unless you used escape codes. – Steve Jessop Aug 31 '12 at 08:23
4

the set of escape sequences in standard C include the following:

\' 
\" 
\? 
\\ 
\a  (alert - usually Ctrl-G)
\b  (backspace)
\f  (form feed)
\n  
\r
\t
\v  (vertical tab)

Note that the \? is in there so the question mark can be escaped so a sequence like "??!" can be encoded as `"\?\?!" to prevent it from being interpreted as a dreaded trigraph.

For completeness, I would consider handling each of these (though some of them like \a and \v I might escape using a \x escape sequence instead - that may depend on your needs). Also, for any other non-printable character, I'd convert to its hex equivalent using the \x escape sequence.

Michael Burr
  • 333,147
  • 50
  • 533
  • 760