Is there an established way to write parsers that can reconstruct their exact input?

Question

Say I want to parse a file in language X. Really, I'm only interested in a small part of the information within. It's easy enough to write a parser in one of Haskell's many eDSLs for that purpose (e.g. Megaparsec).

data Foo = Foo Int  -- the information I'm after.

parseFoo :: Parsec Text Foo
parseFoo = ...

That readily gives rise to a function getFoo :: Text -> Maybe Foo.

But now I would also like to modify the source of the Foo information, i.e. basically I want to implement

changeFoo :: (Foo -> Foo) -> Text -> Text

with the properties

changeFoo id ≡ id
getFoo . changeFoo f ≡ fmap f . getFoo

It's possible to do that by changing the result of the parser to something like a lens

parseFoo :: Parsec Text (Foo, Foo -> Text)
parseFoo = ...

but that makes the definition a lot more cumbersome – I can't just gloss over irrelevant information anymore, but need to store the match of every string subparse and manually reassemble it.

This could be somewhat automated by keeping the string-reassembage in a StateT layer around the parser monad, but I couldn't just use the existing primitive parsers.

Is there an existing solution for this problem?

Parsing libraries generally construct an "Abstract Syntax Tree", which leaves out all formatting. What you are looking for is a "Concrete Syntax Tree". I don't know of any libraries which build this. — sapanoia, Sep 30 '16 at 20:50
I usually use [definite clause grammars](https://stackoverflow.com/a/44347314/975097) in Prolog for bidirectional parsing, but I haven't seen anything like this in Haskell. — Anderson Green, Jan 09 '19 at 22:21

d8d0d65b3f7cf42 · Answer 1 · 2016-05-30T17:11:56.600

3

Is this a case of "bidirectional transformation"? E.g., http://ceur-ws.org/Vol-1571/

In particular, "Invertible Syntax Descriptions: Unifying Parsing and Pretty Printing" by Rendel and Osterman http://dblp.org/rec/conf/haskell/RendelO10 , Haskell Symposium 2010 (cf. http://lambda-the-ultimate.org/node/4191 )

edited May 30 '16 at 17:11

answered May 30 '16 at 16:50

d8d0d65b3f7cf42

2,597
15
28

1

Yes, that's great work; but does it really adress my problem? I was under the impression that these invertible parsers mainly guarantee `parse . print ≡ id`. Can they also assure anything about `print . parse`? – leftaroundabout May 30 '16 at 16:57
If you're pretty-printing, then `print ( parse s) === s` modulo whitespace? It should be even easier to build basic printer/parsers for exact-printing. – d8d0d65b3f7cf42 May 30 '16 at 17:03
2

Well, the thing is, in my application _most_ of the file is “whitespace” (or rather “comments”), since I don't bother to actually parse most of the content in detail, only a few variables. – leftaroundabout May 30 '16 at 17:07
You need a whitespace parser that returns what it has read, and not ignores it? The "lens" model looks reasonable to me. Whether it's useful would depend on the combinators. I would not just blame the type... – d8d0d65b3f7cf42 May 30 '16 at 17:15

score 1 · Answer 2 · edited May 23 '17 at 12:00

A solution implemented in Haskell? I don't know of one; they may exist.

In general, though, one can store enough information to regenerate a legal version of the program that resembles the original to an arbitrary degree, by storing "formatting" information with collected tokens. In the limit, the format information is the original string for the token; any approximation of that will give successively less accurate answers.

If you keep whitespace as explicit tokens in the parse tree, in the limit you can even regenerate that. Whether that is useful likely depends on the application. In general, I think this is overkill.

Details on what/how to capture and how to regenerate can be found in my SO answer: Compiling an AST back to source code

Is there an established way to write parsers that can reconstruct their exact input?

2 Answers2

Linked