Why doesn't "between (char '"') (char '"') (many charLiteral)" work for parsing string literals?

Question

The documentation for Text.Megaparsec.Char.Lexer.charLiteral suggests using char '"' *> manyTill charLiteral (char '"') for parsing string literals (where manyTill is defined in the module Control.Applicative.Combinators in the parser-combinators library).

However, Control.Applicative.Combinators also defines between, which -- as far as I can see -- should do the same as the above suggestion when used like so: between (char '"') (char '"') (many charLiteral).

However, using the between parser above does not work for parsing string literals -- failing with "unexpected end of input. expecting '"' or literal character" (indicating that the ending quote is never detected). Why not?

Also, more generally, why isn't between pBegin pEnd (many p) equivalent to pBegin *> manyTill p pEnd?

Side note: this would probably work fine in parsers that backtrack by default, which is basically all of them other than (mega)parsec. — Joseph Sible-Reinstate Monica, May 06 '20 at 14:26
@JosephSible-ReinstateMonica no, I don't think backtracking can help you here. That would be if `charLiteral` actually failed on `"` (but did still consume the character, so control wouldn't be given back to what comes after the `many`). But in this case, `many charLiteral` actually succeeds – it successfully consumes _all the input_. — Anyway, megaparsec _does_ auto-backtrack on simple chars and strings, just not on bigger combinators. — leftaroundabout, May 06 '20 at 16:32
@leftaroundabout Upon testing, it looks like this is more subtle than either of us thought. It works as I expected [with `regex-applicative`](https://pastebin.com/igC6NHB1) and [with `ReadP`](https://pastebin.com/ihjZsTAX) but fails as you expected [with `attoparsec`](https://pastebin.com/wkYrigQF). I'm confused because I thought `attoparsec` did backtrack like those others. — Joseph Sible-Reinstate Monica, May 06 '20 at 17:30

score 7 · Accepted Answer · answered May 06 '20 at 09:19

between l r m doesn't do anything spectacular, it really just tries l then m then r and gives back the result of m. So, in between (char '"') (char '"') (many charLiteral), the many charLiteral doesn't know it's not supposed to consume the ". The many just keeps consuming whatever its argument parser accepts... which, because charLiteral just accepts anything, means it churns right through everything until the end of the input. The second char '"' has no way of stopping this, it just needs to make do with what's left... i.e., fail because there is nothing left!

By contrast, manyTill actually checks whether the “till”, matches, and only applies each iteration of the content parser when it doesn't. Therefore, the terminating " is not passed to charLiteral, and you get the desired behaviour.

Thank you. Would it be correct to sum it up by saying that `between l r (many p)` and `l *> manyTill p r` are equivalent iff `p` does not consume what `r` consumes? — runeks, May 08 '20 at 12:57
I think yes, but after Joseph's comments I'm not so sure anymore I understand what's going on. — leftaroundabout, May 08 '20 at 14:03

Why doesn't "between (char '"') (char '"') (many charLiteral)" work for parsing string literals?

1 Answers1