2

I need to parse a simple proprietary language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser. TinyPG generates a simple LL(1) parser.

The problem I'm currently having is related to how the language divides the file into sections. It has sections for different kinds of variables, setting their initial values, equation definitions, etc. I only care about sections that declare variables, so I would like to just ignore the rest. I don't know all the rules for the other sections, and don't want to have to figure them out. They may be treated as comments.

Here's a code example:

  PARAMETER
    Density             AS REAL
    CrossSectionalArea  AS REAL

 SET # Parameter values
    T101.FO                 := "SimpleEventFOI::dummy";
    T101.CrossSectionalArea := 1    ; # m2

EQUATION
    OutSingleInt = SingleInt;
    OutArrayInt = ArrayInt;

I care about the PARAMETER and SET sections, but not the EQUATION section. As you can see, the problem is that these sections have no END markers. So I can't figure out how to tell the grammar that a section ends when you get a different keyword, but that the new keyword may start a new section. In my attempts the new section starting keyword gets consumed to close off the old section.

There are many more sections than I have listed here, some of which I care about, some I don't. They seem to fall into two types, "Looks like PARAMETER" which don't have semicolons at the end of the statements, and "Looks like EQUATION" which do. This language is not case or whitespace sensitive. The sections could be in any order. (e.g. SET, EQUATION, PARAMETER) Aside from comments, the whole thing could be written on one line.

Currently I'm getting around this by using a regular expression to find the sections I'm interested in, and only feeding those to the parser, but I'm also having trouble coming up with a regular expression that works in all cases, but doesn't accidentally pick up keywords in comments. I may end up just expanding this workaround to solve it's issues, but it would be nicer to solve the problem directly in the grammar. It's possible this just isn't an LL(1) language.

Jim
  • 651
  • 2
  • 7
  • 15
  • 1
    In your example, it looks like "SET" is the end marker for "PARAMETER"; and that "EQUATION" is the end marker for "SET" and that <> is the end marker for "EQUATION". To be sure, you'd need to start from a specification of the language, not an example... it is important you don't come to rely upon some syntactic coincidence in your sample data. [BTW: It definitely looks 'LL(1)' - perhaps one could even argue that it is in the regular subset of LL(1)] – aSteve Jun 04 '15 at 08:55
  • @aSteve But perhaps they are optional... Could you build a valid file without the SET section? – xanatos Jun 04 '15 at 09:03
  • 2
    Without a specification for the grammar of the file you want to parse, you can never know the answer to your question. You could guess, and you might get lucky - no guarantees. The specification would not necessarily be BNF - for example the source code for the program which writes all the files to be parsed would suffice... or there may be adequate details in informal documentation. An example is not a specification. :) – aSteve Jun 04 '15 at 09:09
  • Is the language whitespace sensitive - like the preprocessors in python or haskell? Then you could match the end of the block to the next line that has less indentation than the former? – mfeineis Jun 04 '15 at 09:13
  • @aSteve The sections can be in any order. EQUATION could come first or second, for example. I don't have a specification, I just have a bunch of examples, which would be too much to posthere. – Jim Jun 04 '15 at 09:32
  • @vanhelgen The language is not whitespace sensitive. Aside from the fact that # comments go until the end of the line. – Jim Jun 04 '15 at 09:34
  • If you have no specification for what you want to match, you need to derive an acceptable guess from all the examples you have. Without first establishing such a specification, it is impossible to say what is relevant about the section headings... Perhaps what matters is that they're the only lines with a single identifier; perhaps it matters that the identifier is not indented; perhaps only a few identifiers count; perhaps it matters that the identifier is all-capitals. Without comprehensive examples no-one else can even take a sensible guess about what you actually want. – aSteve Jun 04 '15 at 14:11
  • @aSteve Yeah, I seem to have underspecified the problem. I added a paragraph the try to clear up your questions. Let me know if there are more questions. – Jim Jun 05 '15 at 00:37

1 Answers1

0

I tried the following tpg code, it can parse your example. Looks TinyPG cannot distinguish keyword and id so I hacked the ID a little bit.

//Tiny Parser Generator v1.3
//Copyright © Herre Kuijpers 2008-2012

<% @TinyPG Namespace="Test" %>

PARAMETER   -> @"PARAMETER";
SET         -> @"SET";
EQUATION    -> @"EQUATION";

AS          -> @"AS";

ID          -> @"\b(?!(PARAMETER|SET|EQUATION)\b)([a-zA-Z]\w+)";
DOT         -> @"\.";
EQ          -> @":=";
EXPR        -> @"\d|""[^""]*""";
END         -> @";";

[Skip] WS   -> @"\s+|#[^\r\n]+";

EQDECL      -> @"\b(?!(PARAMETER|SET|EQUATION)\b)([^#;]+)";
Equations   -> EQUATION (EQDECL END)*;

Parameters  -> PARAMETER ParamDecl*;
ParamDecl   -> ID AS ID;

Sets        -> SET SetDecl*;
SetDecl     -> FullId EQ EXPR END;
FullId      -> ID DOT ID;

Section     -> Equations | Parameters | Sets;

Start       -> Section*;
Paul Chen
  • 1,873
  • 1
  • 16
  • 28
  • Thanks for this, but section order is not guaranteed. This example does not work if you mix up the sections a bit. (I should've done this in my example.) So, if the sections are PARAMETER EQUATION SET tpg will not parse it correctly. It will claim to parse it, but SET is all sucked up by EQUATION. – Jim Jun 04 '15 at 09:30
  • Ah, if each equation ends with ';' and if parameters decls always the first section, it's possible to modify this to make it work. – Paul Chen Jun 04 '15 at 09:43
  • There's an idea. The sections I really don't want to parse all seem to have statements ending in either ';' or "END" – Jim Jun 04 '15 at 09:57
  • 1
    I edited and relax the rules to allow sections in mixed order. @Jim – Paul Chen Jun 04 '15 at 10:51