Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

Question

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:

Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. (§5.1.1., ISO/IEC9899:201x)

So far we came up with two possible approaches to this problem:

1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.

2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...

Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Isn't this basically the same problem you're going to have with comments? There are characters in the input stream that you wish to pretend don't exist, but still need to be tracked for accurate line and column numbers. How do you handle comments? Can you handle these escape sequences in the same way? — amalloy, Nov 03 '17 at 06:39
@amalloy: It's not exactly the same. Handling comments happens inside our `lexeme` combinator which first executes another parser and then consumes all whitespace after the token. Note that comments can only appear between tokens, hence handling it in `lexeme` is sufficient. In contrast, `\\\n` can appear anywhere, even inside a string literal or a decimal etc... — Chirs, Nov 03 '17 at 09:32
The C preprocessor is annoyingly resilient against functional refactoring. Keep in mind you'll probably eventually want to handle `#line` directives too. So I think maybe you'll need to pair your chars and identifiers up with some kind of `LocationSpan` type. — NovaDenizen, Nov 03 '17 at 14:50
Perhaps you could wrap `ParsecT` in a newtype and write a modified `MonadParsec` instance for it, one in which functions like `token` and `tokens` skipped line splices. That way you wouldn't need to modify "derived" combinators like `string`. — danidiaz, Nov 03 '17 at 19:39
@danidiaz, We haven't implemented it as our instructor told us to skip this part of the specification, but I think this is the cleanest way to do it. Thank you for the suggestion! If you add this is an answer I will accept it. — Chirs, Nov 25 '17 at 14:25

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

0 Answers0