4

I am attempting to create a text parser that will allow limited user defined substitution rules.

To wit I am reading in codes from a DOS ASCII file in which the ordering is significant and the line numbering must be maintained. With this input I want to apply user defined substitution rules (exchange this string for that string, if we see this string followed by that string perform this translation, etc).

The output is also a formatted DOS ASCII file.

Most of the rules are straight forward replace tit for tat type replacements, however, there are situations where I want to define a rule like if A is followed by B at any point in the future, apply this rule.

To do this I'm using a tree of structs as such:

struct node {
    list<string> common;  // the text which is not affected by conditions
    string condition;     // matching this string selects the left, otherwise the right
    node *lptr, *rptr;    // pointers to the child nodes, if needed
};

Any time I encounter such a rule I can maintain the output with the rule both omitted and applied, delaying the decision of which to use until it's resolved unambiguously.

It's somewhat memory wasteful, but seems the best way to avoid having to pass over the input data twice (the size of the input data is unknown but probably less than 1 meg).

Of course such a case might exist that a different rule of this type triggers within one or both of the child nodes, so that's why a tree structure.

There is no restriction that the children must be decided before their parents, it may be that the parent is only decidable on one branch of the child. Encountering EOF would decide any undecided children in the false direction.

So it's clear I must be careful when rewinding and collapsing nodes.

Is there a simpler solution to this general problem? Is there a way to use standard library containers in a more efficient way than my tree presents?

Stephen
  • 1,607
  • 2
  • 18
  • 40

3 Answers3

0

You may want to look at NFA and DFA, namely nondeterministic finite-state automa and deterministic finite-state automa. These two approaches are the most common and often very effective way to writer parsers.

There is actually no need to store the data in the node, otherwise it will get passed around and wastes memory. A better way to do this is, you assign a variable (say int state = 0) to keep track of the current parsing state. Based on the current state and input, your algorithm will then alter the state. The states always go forward, but you can tell your algorithm to return to some previous state if a certain condition is not matched (known as "backtracking").

E.g. if "ab" and "ac" are two valid inputs, when parsing "ac" the algorithm may go like this:

char is 'a' ==> go to state.checkB
char is not 'b' ==> go back to state.checkA
checkB was already done ==> go to state.checkC
char is 'c' ==> DoSomething();

It takes a load of articles and graphs to completely explain everything, hope this will give you some idea where to look further.

kevin
  • 2,196
  • 1
  • 20
  • 24
  • The problems that I see are a)the input is of unknown size so I don't want to keep it entirely in memory, b) the only actions I am performing on this input is translation, I am re-outputting the text, c) the translations are unknown and defined by the user and d) the number of allowed symbols is enormous thus there is a large state space (one of 26 initial symbols followed by one of 999 terminal symbols) and knowing which I just read doesn't really add any intuition to the solution ... I do not have to verify the input file is well formed just perform the small number of translations. – Stephen May 24 '12 at 00:17
  • I have studied NFAs and DFAs, I did use them to refine the gross logic of the problem. I just think that the state space and the dynamic nature of both the input and the translations makes it impractical. – Stephen May 24 '12 at 00:18
0

Presuming that by 'text parser' you mean that you are trying to condense down words and phrases of the same meaning to simplify reacting to commands.

In which case as per old text adventure programs a simple left right parser using a lookup table of rules would work here.

Unless I misunderstand your problem domain your solution seems awfully over-engineered.

  • I will review left right parsing and a look up table and see if I can make use of it. From what I remember of it in university I think the "if I see this symbol followed by this other symbol (at any point in the future) apply this rule". I am trying to avoid 2 pass parsing using the first pass to identify just these situations. – Stephen May 24 '12 at 00:20
0

Sounds like you should try regular expressions. Here's a link to discussion about choosing a library: C++ RegEx Library Choice. Boost is a popular one.

Also, have you considered using another language to tackle the problem? Python has an excellent base of useful libraries, including one for regular expressions (import re). If it's in your wheelhouse you may find it easier than a C++ solution.

Finally, consider using an "already defined" text format instead of a custom one for the input file. XML is a good choice. It may be easier to nest rules in an XML tree. C++ you can use The Expat XML Parser (Python would be xml.etree.ElementTree).

Community
  • 1
  • 1
brad-tot
  • 783
  • 7
  • 9
  • The standards I'm working with are from the 1970's and I have no flexibility to alter them. I can only use C++ or flat C – Stephen Jun 14 '12 at 20:47