for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. (§5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n
is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char'
combinator that behaves like char
but looks an extra character ahead and will silently consume any \\\n
. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char
with char'
in any parser, even in the megaparsec-provided ones like string
, integer
, whitespace
etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?