Good practice to parse data in a custom format

Question

I'm writing a program that takes in input a straight play in a custom format and then performs some analysis on it (like number of lines and words for each character). It's just for fun, and a pretext for learning cool stuff. The first step in that process is writing a parser for that format. It goes :

####Play
###Act I
##Scene 1
CHARACTER 1. Line 1, he's saying some stuff.
#Comment, stage direction
CHARACTER 2, doing some stuff. Line 2, she's saying some stuff too.

It's quite a simple format. I read extensively about basic parser stuff like CFG, so I am now ready to get some work done.

I have written my grammar in EBNF and started playing with flex/bison but it raises some questions :

Is flex/bison too much for such a simple parser ? Should I just write it myself as described here : Is there an alternative for flex/bison that is usable on 8-bit embedded systems? ?
What is good practice regarding the respective tasks of the tokenizer and the parser itself ? There is never a single solution, and for such a simple language they often overlap. This is especially true for flex/bison, where flex can perform some intense stuff with regex matching. For example, should "#" be a token ? Should "####" be a token too ? Should I create types that carry semantic information so I can directly identify for example a character ? Or should I just process it with flex the simplest way then let the grammar defined in bison decide what is what ?
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool ?

This got me really confused. I am looking for an elegant, perhaps simple solution. Any guideline ?

By the way, about the programing language, I don't care much. For now I am using C because of flex/bison but feel free to advise me on anything more practical as long as it is a widely used language.

By the way, if anyone wondered, this is the [Shakespeare Programming Language](http://en.wikipedia.org/wiki/Shakespeare_%28programming_language%29) for which there is a compiler: [http://shakespearelang.sourceforge.net/](http://shakespearelang.sourceforge.net/) — Brian Tompsett - 汤莱恩, Jan 22 '15 at 15:45

score 2 · Accepted Answer · answered Oct 22 '14 at 16:44

It's very difficult to answer those questions without knowing what your parsing expectations are. That is, an example of a few lines of text does not provide a clear understanding of what the intended parse is; what the lexical and syntactic units are; what relationships you would like to extract; and so on.

However, a rough guess might be that you intend to produce a nested parse, where ##{i} indicates the nesting level (inversely), with i≥1, since a single # is not structural. That violates one principle of language design ("don't make the user count things which the computer could count more accurately"), which might suggest a structure more like:

@play {
@act {
@scene {
@location: Elsinore. A platform before the castle.
@direction: FRANCISCO at his post. Enter to him BERNARDO 
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
FRANCISCO: Bernardo?

or even something XML-like. But that would be a different language :)

The problem with parsing either of these with a classic scanner/parser combination is that the lexical structure is inconsistent; the first token on a line is special, but most of the file consists of unparsed text. That will almost inevitably lead to spreading syntactic information between the scanner and the parser, because the scanner needs to know the syntactic context in order to decide whether or not it is scanning raw text.

You might be able to avoid that issue. For example, you might require that a continuation line start with whitespace, so that every line not otherwise marked with #'s starts with the name of a character. That would be more reliable than recognizing a dialogue line just because it starts with the name of a character and a period, since it is quite possible for a character's name to be used in dialogue, even at the end of a sentence (which consequently might be the first word in a continuation line.)

If you do intend for dialogue lines to be distinguished by the fact that they start with a character name and some punctuation then you will definitely have to give the scanner access to the character list (as a sort of symbol table), which is a well-known but not particularly respected hack.

Consider the above a reflection about your second question ("What are the roles of the scanner and the parser?"), which does not qualify as an answer but hopefully is at least food for thought. As to your other questions, and recognizing that all of this is opinionated:

Is flex/bison too much for such a simple parser ? Should I just write it myself...

The fact that flex and bison are (potentially) more powerful than necessary to parse a particular language is a red herring. C is more powerful than necessary to write a factorial function -- you could easily do it in assembler -- but writing a factorial function is a good exercise in learning C. Similarly, if you want to learn how to write parsers, it's a good idea to start with a simple language; obviously, that's not going to exercise every option in the parser/scanner generators, but it will get you started. The question really is whether the language you're designing is appropriate for this style of parsing, not whether it is too simple.

With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool?

Either can be elegant, or disastrous; elegance has more to do with how you structure your thinking about the problem at hand. Having said that, it is often better to build a semantic structure (commonly referred to as an AST -- abstract syntax tree) during the parse phase and then analyse that structure using other functions.

Rescanning the input file is very unlikely to be either elegant or effective.

Good practice to parse data in a custom format

1 Answers1