strings with space between them

Question

I'm trying to write a Lexical Analyzer for C tokens by constructing DFAs for each of tokens and simulating them in C. Currently I'm trying to identify a string literal. By the definition, string literals are the characters that are enclosed between " .Consider the following program :

#include<stdio.h>
int main()
{
    char *a = "Hello "


    "World";
    printf("%s",a);
}

Output:

Hello World

So now I am confused whether i should consider Hello and World as seperate tokens or Hello World combined as a single token? Thank you ! :)

any two string literals will be concated in compile time if there are only white chars between them. — Jason Hu, Oct 02 '15 at 17:25
@Michi , does the C compiler consider them as multiple tokens or single?, I assume the preprocessor strips all the whitespace before tokenizing ? — Pruthvi Raj, Oct 02 '15 at 17:26
`"Hello"` and `"World"` are two separate *tokens*. That's a lexical analysis consideration. When they appear adjacent to one another, they represent two parts of a single string literal. That's a semantic consideration -- i.e. what that combination of tokens means in C source code. — John Bollinger, Oct 02 '15 at 17:26
@JohnBollinger , oh so the lexical analyzer just sends the token down the phases of the compiler seperately and the semantic analyzer concatenates them? — Pruthvi Raj, Oct 02 '15 at 17:29
@PruthviRaj, yes, that would be a conventional compiler architecture. — John Bollinger, Oct 02 '15 at 17:31
Yes, i was thinking somewhere else :). I thought he neet to know what happens with all white spaces. — Michi, Oct 02 '15 at 17:31
@JohnBollinger , I see , Thanks, could you please post it as an answer so that I can accept it :) — Pruthvi Raj, Oct 02 '15 at 17:32
@Michi , No, I know what happens with the whitespace, but then just wanted to know how classic C lexical analyzer would consider those string literals — Pruthvi Raj, Oct 02 '15 at 17:33
You're asking about an implementation detail. The C specification requires that `"Hello" "World"` be concatenated into a single string literal. It doesn't tell you how to implement that requirement. So you can do whatever you want. — user3386109, Oct 02 '15 at 17:37
@user3386109 , ah I see, i've seen source of classic implementation of C on github, it seems to consider them as seperate tokens , so I guess that's how they were implemented? — Pruthvi Raj, Oct 02 '15 at 17:40
I'll take your word for it, I'm not familiar with the github project that you refer to. It is certainly reasonable to treat them as separate tokens. — user3386109, Oct 02 '15 at 17:48

score 2 · Accepted Answer · answered Oct 02 '15 at 17:52

In comments I wrote

"Hello" and "World" are two separate tokens. That's a lexical analysis consideration. When they appear as consecutive tokens, they represent two parts of a single string literal. That's a semantic consideration -- i.e. what that combination of tokens means in C source code.

That describes a view of the question in terms of conventional, generic compiler construction. For example, the distinction is between what might be represented in a lex scanner definition and what would be handled in a yacc parser description (to put it in terms of the traditional tools).

In practice, C defines a larger and more detailed set of "translation phases" for building an executable program from C sources (C99 5.1.1.2). In C's particular model of the process, the "Hello" and "World" are separate preprocessing tokens, identified in translation phase 3. These are concatenated into a single token at translation phase 6. All (remaining) preprocessing tokens are converted to straight-up tokens at transalation phase 7. The resulting tokens are then the input to the semantic analysis (also part of phase 7).

C does not require implementations to actually implement translation (compilation) according to the given model, with all its separate phases, and many do not. C just requires that the end result be as if the implementation behaved according to the model. In that sense, your question can only be answered "it depends". As far as a non C-specific conceptualization of the inferred question "what is a token", however, I will maintain that my original, short, description provides a useful mental model.

Thank you, It'd be helpful if you can cite me a link to those `transition phases` you are referring to so I can read upon :) — Pruthvi Raj, Oct 02 '15 at 18:04
@PruthviRaj, I gave you a citation to the appropriate section of the (C99) standard. You can find information about where to get the referenced document here: http://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents — John Bollinger, Oct 02 '15 at 18:06

strings with space between them

1 Answers1