5

Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.

It's hard to understand for me how does regular lexical analyzers tokenize them. How do lexical analyzers like lex, flex, or etc.. handle this kind of symbols? Is there a generic method? Or just case by case for each language?

eonil
  • 83,476
  • 81
  • 317
  • 516

3 Answers3

1

Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.

I’m not sure what you mean but this statement is certainly wrong. Both comments (unless they may be nested) and strings with escape sequence admit a simple regular language description.

For example, an escape sequence allowing \\, \", \n and \r can be described by the following regular grammar (with start symbol E):

E -> \ S
S -> \
S -> "
S -> n
S -> r
…

And a string is just a repetition of zero or more unescaped symbols or escape sequences (i.e. a Kleene closure over two regular languages, which is itself regular).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • It's hard to understand for me how to tokenize them.. This is the meaning of the words :) – eonil Mar 06 '11 at 11:47
  • @Eonil How to tokenize them manually or how lex does it? Either way, it does it the same way that it tokenizes the rest of the input. Every introduction to regular languages and compiler construction contains the necessary know-how; you should get a good book. – Konrad Rudolph Mar 06 '11 at 11:50
  • Does it mean each escapes in string literal will be split into several tokens in tokenization stage? ex) "a \n b" -> {", a, (whitespace), \n, (whitespace), "} – eonil Mar 06 '11 at 11:56
  • @Konrad Could you recommend some books? – eonil Mar 06 '11 at 11:57
  • @Eonil Book recommendations can be found here: http://stackoverflow.com/q/1669/1968. About your other question: that depends on how you define the rules. I don’t really know `lex` but I think that `lex` will definitely let your output several tokens *or* just one token for the whole string, depending on what you need. – Konrad Rudolph Mar 06 '11 at 12:01
  • @Konrad Thanks a lot. I'll check them out! – eonil Mar 06 '11 at 12:05
1

I can't say anything for lex, but in my lexer for my language (using C++ style // comments) I have already split the input by lines (seeing as it's a Python-inspired language), I have a regex that matches the // and then any number of any characters.

PrettyPrincessKitty FS
  • 6,117
  • 5
  • 36
  • 51
1

I think this - case by case for each language - is true.
If comment starter exists in a string literal, lexer has to ignore it. Similarly, in C, if escaped double quote \" exists in a string literal, lexer has to ignore it.
For this purpose, flex has start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex texinfo manual:

<INITIAL>"/*"   BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/"            BEGIN(INITIAL);
[^*\n]+         /* eat comment in chunks */
"*"             /* eat the lone star */
\n              yylineno++;
}

Start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled How do I expand backslash-escape sequences in C-style quoted strings? in flex texinfo manual.
Probably this will answer directly your question about string literal.

Lesmana
  • 25,663
  • 9
  • 82
  • 87
Ise Wisteria
  • 11,259
  • 2
  • 43
  • 26