lexer/parser ambiguity

Question

How does a lexer solve this ambiguity?

/*/*/

How is it that it doesn't just say, oh yeah, that's the begining of a multi-line comment, followed by another multi-line comment.

Wouldn't a greedy lexer just return the following tokens?

/*
/*
/

I'm in the midst of writing a shift-reduce parser for CSS and yet this simple comment thing is in my way. You can read this question if you wan't some more background information.

UPDATE

Sorry for leaving this out in the first place. I'm planning to add extensions to the CSS language in this form /* @ func ( args, ... ) */ but I don't want to confuse an editor which understands CSS but not this extension comment of mine. That's why the lexer just can't ignore comments.

As noted in your "this question" response, the lexer should enter a "in comment" state and discard input until it sees a lexeme that moves it out of that state. The parser should never see comments, and the lexer shouldn't see the content of comments except to determine when they end. — msw, Apr 13 '10 at 23:45
@msw: Of course, having the parser never see comments is not a hard rule. You can do some pretty cool things by treating comments as tokens and giving them to the parser - just look at Python docstrings. — Brian McKenna, Apr 13 '10 at 23:55
Indeed, I was specifically referring to the C-style comments and their lexical relation to the grammar. I could have more clearly pointed to the OPs comment that he shouldn't confuse the lexical and syntactic interpretations. Agreed also that Python docstrings are useful (and javadoc, etc.). I've not looked at the Python grammar, but I'm betting there is a production for . — msw, Apr 14 '10 at 01:10
The thing is I really wanna feed the comments to the parser. — John Leidegren, Apr 14 '10 at 06:20

score 10 · Accepted Answer · edited Nov 17 '17 at 21:50

10

One way to do it is for the lexer to enter a different internal state on encountering the first /*. For example, flex calls these "start conditions" (matching C-style comments is one of the examples on that page).

edited Nov 17 '17 at 21:50

rici

234,347
28
237
341

answered Apr 13 '10 at 23:43

Matthew Slattery

45,290
8
103
119

I believe this is the right thing for me. As I left out the part that I really want to parse the comments... – John Leidegren Apr 14 '10 at 06:34
@John Leidegren: Make sure you understand inclusive and exclusive conditions. They are very handy indeed. – leppie Apr 14 '10 at 06:40
@leppie - in lex? I'm against using a tool right now. I might consider a lexer generator later, but I really wanna get into the bare metal of writing both a lexical analysis and parser. – John Leidegren Apr 14 '10 at 06:53
@John Leidegren: I know they are in Flex. But if you are writing your own, it is rather trivial to implement (via stacks and sets). – leppie Apr 14 '10 at 07:09

Amber · Answer 2 · 2010-04-13T23:48:41.347

6

The simplest way would probably be to lex the comment as one single token - that is, don't emit a "START COMMENT" token, but instead continue reading in input until you can emit a "COMMENT BLOCK" token that includes the entire /*(anything)*/ bit.

Since comments are not relevant to the actual parsing of executable code, it's fine for them to basically be stripped out by the lexer (or at least, clumped into a single token). You don't care about token matches within a comment.

edited Apr 13 '10 at 23:48

answered Apr 13 '10 at 23:42

Amber

507,862
82
626
550

The thing is that I'm going to add extensions in the form of `/* @ func ( args, ... ) */` that is why I just can't throw away comments. Also the CSS2 specification says that both / * are separate delimiter tokens. – John Leidegren Apr 14 '10 at 06:32
1

You could recursively parse comment tokens once they're identified, if you so chose - depending on how complex your extensions are, they might be able to be regex'd out of a flat comment string, instead of needing to be lexed. – Amber Apr 14 '10 at 06:46
Matching on the comment with regex does sound practical, doesn't feel very formal though. I'd like to avoid this for now. Right now I wanna focus on understanding the underpinnings of lexical analysis. – John Leidegren Apr 14 '10 at 06:55

mjv · Answer 3 · 2010-04-13T23:42:53.883

3

In most languages, this is not ambiguous: the first slash and asterix are consumed to produce the "start of multi-line comment" token. It is followed by a slash which is plain "content" within the comment and finally the last two characters are the "end of multi-line comment" token.

Since the first 2 characters are consumed, the first asterix cannot also be used to produce an end of comment token. I just noted that it could produce a second "start of comment" token... oops, that could be a problem, depending on the amount of context is available for the parser.

I speak here of tokens, assuming a parser-level handling of the comments. But the same applies to a lexer, whereby the underlying rule is to start with '/*' and then not stop till '*/' is found. Effectively, a lexer-level handling of the whole comment wouldn't be confused by the second "start of comment".

edited Apr 13 '10 at 23:42

answered Apr 13 '10 at 23:37

mjv

73,152
14
113
156

I think the ambiguity the OP is concerned about isn't the first `/*/`, but instead the second `/*/` - they're worried that their lexer would emit two "start comment" tokens and then have consumed 4 of the 5 characters, leaving only a `/` character (and thus no "end comment" token). I think my answer is a little more explicit about how to avoid this on the lexer level. – Amber Apr 13 '10 at 23:46
@Dav: right you are; I caught on to the issue of 2nd "start comment" token while editing, and too came to the conclusion that a lexer-level handling of the whole comment is the easiest. However, some applications need parse stuff within comments (ex: documentation building apps etc.),a nd for them, the parser-level needs to get some context (in some form or other) to get out of this pickle. – mjv Apr 13 '10 at 23:59

score 1 · Answer 4 · edited Jun 19 '15 at 12:13

1

Since CSS does not support nested comments, your example would typically parse into a single token, COMMENT. That is, the lexer would see /* as a start-comment marker and then consume everything up to and including a */ sequence.

edited Jun 19 '15 at 12:13

Bakuriu

98,325
22
197
231

answered Apr 13 '10 at 23:47

500 - Internal Server Error

28,327
8
59
66

1

You can use enclosing backticks (`\``) to mark only a span of text as code, instead of entire lines, like this: `/*` – Amber Apr 13 '10 at 23:50

Ming-Tang · Answer 5 · 2010-04-13T23:42:56.087

0

Use the regexp's algorithm, search from the beginning of the string working way back to the current location.

if (chars[currentLocation] == '/' and chars[currentLocation - 1] == '*') {
  for (int i = currentLocation - 2; i >= 0; i --) {
    if (chars[i] == '/' && chars[i + 1] == '*') {
      // .......
    }
  }
}

It's like applying the regexp /\*([^\*]|\*[^\/])\*/ greedy and bottom-up.

edited Apr 13 '10 at 23:42

answered Apr 13 '10 at 23:36

Ming-Tang

17,410
8
38
76

Cam · Answer 6 · 2010-04-14T00:07:49.157

One way to solve this would be to have your lexer return:

/
*
/
*
/

And have your parser deal with it from there. That's what I'd probably do for most programming languages, as the /'s and *'s can also be used for multiplication and other such things, which are all too complicated for the lexer to worry about. The lexer should really just be returning elementary symbols.

If what the token is starts to depend too much on context, what you're looking for may very well be a simpler token.

That being said, CSS is not a programming language so /'s and *'s can't be overloaded. Really afaik they can't be used for anything else other than comments. So I'd be very tempted to just pass the whole thing as a comment token unless you have a good reason not to: /\*.*\*/

lexer/parser ambiguity

UPDATE

6 Answers6