0

I'm trying to write a Lexical Analyzer for C tokens by constructing DFAs for each of tokens and simulating them in C. Currently I'm trying to identify a string literal. By the definition, string literals are the characters that are enclosed between " .Consider the following program :

#include<stdio.h>
int main()
{
    char *a = "Hello "


    "World";
    printf("%s",a);
}

Output:

Hello World

So now I am confused whether i should consider Hello and World as seperate tokens or Hello World combined as a single token? Thank you ! :)

Pruthvi Raj
  • 3,016
  • 2
  • 22
  • 36
  • The compiler ignore if there is more then one space – Michi Oct 02 '15 at 17:25
  • any two string literals will be concated in compile time if there are only white chars between them. – Jason Hu Oct 02 '15 at 17:25
  • @Michi , does the C compiler consider them as multiple tokens or single?, I assume the preprocessor strips all the whitespace before tokenizing ? – Pruthvi Raj Oct 02 '15 at 17:26
  • 5
    `"Hello"` and `"World"` are two separate *tokens*. That's a lexical analysis consideration. When they appear adjacent to one another, they represent two parts of a single string literal. That's a semantic consideration -- i.e. what that combination of tokens means in C source code. – John Bollinger Oct 02 '15 at 17:26
  • What does your language spec say? – Scott Hunter Oct 02 '15 at 17:27
  • @JohnBollinger , oh so the lexical analyzer just sends the token down the phases of the compiler seperately and the semantic analyzer concatenates them? – Pruthvi Raj Oct 02 '15 at 17:29
  • 1
    @PruthviRaj, yes, that would be a conventional compiler architecture. – John Bollinger Oct 02 '15 at 17:31
  • Yes, i was thinking somewhere else :). I thought he neet to know what happens with all white spaces. – Michi Oct 02 '15 at 17:31
  • @JohnBollinger , I see , Thanks, could you please post it as an answer so that I can accept it :) – Pruthvi Raj Oct 02 '15 at 17:32
  • @Michi , No, I know what happens with the whitespace, but then just wanted to know how classic C lexical analyzer would consider those string literals – Pruthvi Raj Oct 02 '15 at 17:33
  • You're asking about an implementation detail. The C specification requires that `"Hello" "World"` be concatenated into a single string literal. It doesn't tell you how to implement that requirement. So you can do whatever you want. – user3386109 Oct 02 '15 at 17:37
  • @user3386109 , ah I see, i've seen source of classic implementation of C on github, it seems to consider them as seperate tokens , so I guess that's how they were implemented? – Pruthvi Raj Oct 02 '15 at 17:40
  • I'll take your word for it, I'm not familiar with the github project that you refer to. It is certainly reasonable to treat them as separate tokens. – user3386109 Oct 02 '15 at 17:48

1 Answers1

2

In comments I wrote

"Hello" and "World" are two separate tokens. That's a lexical analysis consideration. When they appear as consecutive tokens, they represent two parts of a single string literal. That's a semantic consideration -- i.e. what that combination of tokens means in C source code.

That describes a view of the question in terms of conventional, generic compiler construction. For example, the distinction is between what might be represented in a lex scanner definition and what would be handled in a yacc parser description (to put it in terms of the traditional tools).

In practice, C defines a larger and more detailed set of "translation phases" for building an executable program from C sources (C99 5.1.1.2). In C's particular model of the process, the "Hello" and "World" are separate preprocessing tokens, identified in translation phase 3. These are concatenated into a single token at translation phase 6. All (remaining) preprocessing tokens are converted to straight-up tokens at transalation phase 7. The resulting tokens are then the input to the semantic analysis (also part of phase 7).

C does not require implementations to actually implement translation (compilation) according to the given model, with all its separate phases, and many do not. C just requires that the end result be as if the implementation behaved according to the model. In that sense, your question can only be answered "it depends". As far as a non C-specific conceptualization of the inferred question "what is a token", however, I will maintain that my original, short, description provides a useful mental model.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • Thank you, It'd be helpful if you can cite me a link to those `transition phases` you are referring to so I can read upon :) – Pruthvi Raj Oct 02 '15 at 18:04
  • @PruthviRaj, I gave you a citation to the appropriate section of the (C99) standard. You can find information about where to get the referenced document here: http://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents – John Bollinger Oct 02 '15 at 18:06