2

I'd like to parse the following:

name:name

where the names start and end with an alnum, and can contain any combination of alnum and spaces inside. They could also be blank. My rules for this are:

identifier = alnum (space* alnum)*;
name       = (identifier | zlen) >sName $pName %fName;

The names can be separated by a colon and optionally spaces inbetween the names and the colon. My rules for this are:

sep = space* ":" space*;
main := name sep name;

This doesn't work because apparently the space* in identifier and the space* in sep confuse the parser. I end up getting the action fName executed in every space of the name.

If I change sep to:

sep = ":";

then everything is fine. How do I modify these rules so that the parser does what I intend?

Source code for this question here: https://gist.github.com/1661150

1 Answers1

10

There are two basic solutions to this kind of problem.

  1. Define the action so it can be executed safely multiple times,
  2. Change the syntax so the action is only executed once.

In this case, I would choose a hybrid approach. Use actions to record the start and ending positions of a name: these actions can be executed safely many times since they just record locations. Once you are sure you are past the name, execute a different action that will only execute once.

/* C code */
char *name_start, *name_end;

/* Ragel code */
action markNameStart { name_start = p; }
action markNameEnd { name_end = p; }
action nameAction {
    /* Clumsy since name is not nul-terminated */
    fputs("Name = ", stdout);
    fwrite(name_start, 1, name_end - name_start, stdout);
    fputc('\n', stdout);
}

name = space* %markNameStart
       (alnum+ %markNameEnd <: space*)+
       %nameAction ;
main := name ":" name ;

Here, the syntax for name includes arbitrary spaces and at least one alphanumeric character. When the first alphanumeric character is encountered, its location is saved in name_start. Whenever run of alphanumeric characters ends, the location of the following character is saved in name_end. The <: is technically unnecessary but it reduces how often the markNameEnd action is executed.

Just be sure not to place such an expression next to any spaces.

I have not tested the above code. You should look at the Graphviz visualization of the state machine before you use.

What Ragel is doing

With your original code, let's suppose the input is as follows:

Hello world : Goodbye world

The Ragel machine scans from left to right, finds the start of a name, and scans over the alphanumeric characters.

Hello world : Goodbye world
    ↑

The next character is a space. So either we have encountered a space inside a word, or the first space after the end of a word. How does Ragel choose?

Ragel chooses both options, at the same time. This is very important. Ragel is trying to simulate a nondeterministic finite automaton, but since your computer is deterministic, the easiest way to do that is to convert the NFA to a DFA which simulates an unlimited number of NFAs in parallel. Since the NFAs have a finite number of states (hence the name), the DFAs also have a finite number of states, so this technique works.

After encountering the space, you have one NFA in the following state, looking for the rest of the name:

identifier = alnum (space* alnum)*;
                    ↑

main := name sep name;
        ↑

The second NFA is in the following state, and it assumes that the name has already ended (and this NFA executes the fName action "prematurely"):

sep = space* ":" space*;
      ↑

main := name sep name;
             ↑

It's obvious to you and it's obvious to me that only the first NFA is correct. But machines created with Ragel only look at one character at a time, they don't look ahead to see which option is correct. The second NFA will eventually encounter an alphanumeric character where it was expecting to see ":", and since that's not allowed, the second NFA will disappear.

A look at the Ragel documentation

Here's the description of %:

expr % action

The leaving action operator queues an action for embedding into the transitions that go out of a machine via a final state.

The action gets executed for transitions that do not necessarily contribute to a successful parse. See the Ragel guide, Chapter 4, "Controlling Nondeterminism" for more information about nondeterminism in Ragel, although the techniques in Chapter 4 won't help you in this particular case since the actions in your machine can only be disambiguated with unbound lookahead, which is not allowed in finite state machines.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • My problem is that there is actually only one correct way to parse, for example, "aa bb : cc dd". The first name must be "aa bb" and the second must be "cc dd". There shouldn't be any way it can be interpreted, can it? I've defined name as ending in an alnum and it cannot contain a colon. At this point I'm almost considering it a bug. –  Jan 23 '12 at 07:30
  • @Ana: There is only one way to interpret your code, and Ragel is doing the correct thing. The problem here is that a DFA will exit "successfully" multiple times, invoking the `%` action each time, even though only one of those exits will contribute to the final parse. How much do you know about automaton theory? If you can't articulate the difference between a nondeterministic and deterministic automaton (and why this is relevant to your problem), Ragel might seem a bit arcane. – Dietrich Epp Jan 23 '12 at 08:04