How do lexical analyzers handle comment and escape sequences?

Question

Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.

It's hard to understand for me how does regular lexical analyzers tokenize them. How do lexical analyzers like lex, flex, or etc.. handle this kind of symbols? Is there a generic method? Or just case by case for each language?

score 1 · Answer 1 · answered Mar 06 '11 at 11:43

1

Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.

I’m not sure what you mean but this statement is certainly wrong. Both comments (unless they may be nested) and strings with escape sequence admit a simple regular language description.

For example, an escape sequence allowing \\, \", \n and \r can be described by the following regular grammar (with start symbol E):

E -> \ S
S -> \
S -> "
S -> n
S -> r
…

And a string is just a repetition of zero or more unescaped symbols or escape sequences (i.e. a Kleene closure over two regular languages, which is itself regular).

answered Mar 06 '11 at 11:43

Konrad Rudolph

530,221
131
937
1,214

It's hard to understand for me how to tokenize them.. This is the meaning of the words :) – eonil Mar 06 '11 at 11:47
@Eonil How to tokenize them manually or how lex does it? Either way, it does it the same way that it tokenizes the rest of the input. Every introduction to regular languages and compiler construction contains the necessary know-how; you should get a good book. – Konrad Rudolph Mar 06 '11 at 11:50
Does it mean each escapes in string literal will be split into several tokens in tokenization stage? ex) "a \n b" -> {", a, (whitespace), \n, (whitespace), "} – eonil Mar 06 '11 at 11:56
@Konrad Could you recommend some books? – eonil Mar 06 '11 at 11:57
@Eonil Book recommendations can be found here: http://stackoverflow.com/q/1669/1968. About your other question: that depends on how you define the rules. I don’t really know `lex` but I think that `lex` will definitely let your output several tokens *or* just one token for the whole string, depending on what you need. – Konrad Rudolph Mar 06 '11 at 12:01
@Konrad Thanks a lot. I'll check them out! – eonil Mar 06 '11 at 12:05

score 1 · Answer 2 · answered Mar 06 '11 at 11:47

1

I can't say anything for lex, but in my lexer for my language (using C++ style // comments) I have already split the input by lines (seeing as it's a Python-inspired language), I have a regex that matches the // and then any number of any characters.

answered Mar 06 '11 at 11:47

PrettyPrincessKitty FS

6,117
5
36
51

If you mean tokenizing right, I would check against a `\` or a `commentsymbolhere`, then consume the rest. – PrettyPrincessKitty FS Mar 06 '11 at 11:59

score 1 · Accepted Answer · edited May 06 '13 at 14:35

I think this - case by case for each language - is true.
If comment starter exists in a string literal, lexer has to ignore it. Similarly, in C, if escaped double quote \" exists in a string literal, lexer has to ignore it.
For this purpose, flex has start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex texinfo manual:

<INITIAL>"/*"   BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/"            BEGIN(INITIAL);
[^*\n]+         /* eat comment in chunks */
"*"             /* eat the lone star */
\n              yylineno++;
}

Start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled How do I expand backslash-escape sequences in C-style quoted strings? in flex texinfo manual.
Probably this will answer directly your question about string literal.

Thanks. Now I can sure that it's really case-by-case (no universal solution) :) — eonil, Mar 07 '11 at 01:24

How do lexical analyzers handle comment and escape sequences?

3 Answers3