I am learning about compiler design. The task of lexical analyser in compiler is to convert the code into stream of tokes. But I am confused why we consider a string as a single token . For example - printf("%d is integer", x);
In this statement printf
, (
, "%d is integer"
, ,
, x
, )
, ;
are the tokens but why %d
in string is not considered a separate token?

- 33
- 3
-
6Because `%d` has no meaning to the compiler itself. It is being interpreted by the `printf` function and not the compiler. The compiler has no use to different parts of the string, so it should take it as a whole. – Eugene Sh. Jun 08 '21 at 14:51
-
Check the generated assembly: https://godbolt.org/z/Pns6hss7v You'll see that `%d` is not separate from the string literal there either because it doesn't need to be. To the compiler, `%d` means nothing. `printf` is the thing that sees it as a format specifier. – mediocrevegetable1 Jun 08 '21 at 15:03
-
Note that some modern compilers do understand enough about format strings for `printf()` and `scanf()` and their relatives to be able to diagnose some misuses, but the basic language does not require that. – Jonathan Leffler Jun 08 '21 at 15:28
-
Worth mention, I think: a typical C tokenizer reading `printf("%" "d" " " "is" " integer"` would return one `printf` token, one `(` token, and then five separate string tokens. It's up to the next phase of compilation to notice the adjacent string tokens and combine them into a single string. If the compiler does have some sort of printf-format-scanner, it must operate on the resulting combined string, not the individual tokens. – torek Jun 13 '21 at 14:18
1 Answers
Because format specifiers like %d
(or any other string contents) are not syntactically meaningful - there's no element of the language grammar that depends on them. String contents (including format specifiers like %d
) are data, not code, and thus not meaningful to the compiler. The character sequence %d
is only meaningful at runtime, and only to the *printf
/*scanf
families of functions, and only as part of a format string.
To recognize %d
as a distinct token, you would have to tokenize the entire string - "
, %d
, is
, integer
, "
. That opens up a whole can of worms on its own, making parsing of strings more difficult.
Some compilers do examine the format string arguments to printf
and scanf
calls to do some basic sanity checking, but that's well after tokenization has already taken place. At the tokenization stage, you don't know that this is a call to the printf
library function. It's not until after syntax analysis that the compiler knows that this is a specific library call and can perform that kind of check.

- 119,563
- 19
- 122
- 198