4

UPDATE 2

Original question: Can I avoid using Ragel's |**| if I don't need backtracking?

Updated answer: Yes, you can write a simple tokenizer with ()* if you don't need backtracking.

UPDATE 1

I realized that asking about XML tokenizing was a red herring, because what I'm doing is not specific to XML.

END UPDATES

I have a Ragel scanner/tokenizer that simply looks for FooBarEntity elements in files like:

<ABC >
  <XYZ >
    <FooBarEntity>
      <Example >Hello world</Example >
    </FooBarEntity>
  </XYZ >
  <XYZ >
    <FooBarEntity>
      <Example >sdrastvui</Example >
    </FooBarEntity>
  </XYZ >
</ABC >

The scanner version:

%%{
  machine simple_scanner;
  action Emit {
    emit data[(ts+14)..(te-15)].pack('c*')
  }
  foo = '<FooBarEntity>' any+ :>> '</FooBarEntity>';
  main := |*
    foo => Emit;
    any;
  *|;
}%%

The non-scanner version (i.e. uses ()* instead of |**|)

%%{
  machine simple_tokenizer;
  action MyTs {
    my_ts = p
  }
  action MyTe {
    my_te = p
  }
  action Emit {
    emit data[my_ts...my_te].pack('c*')
    my_ts = nil
    my_te = nil    
  }
  foo = '<FooBarEntity>' any+ >MyTs :>> '</FooBarEntity>' >MyTe %Emit;
  main := ( foo | any+ )*;
}%%

I figured this out and wrote tests for it at https://github.com/seamusabshere/ruby_ragel_examples

You can see the reading/buffering code at https://github.com/seamusabshere/ruby_ragel_examples/blob/master/lib/simple_scanner.rl and https://github.com/seamusabshere/ruby_ragel_examples/blob/master/lib/simple_tokenizer.rl

Seamus Abshere
  • 8,326
  • 4
  • 44
  • 61

2 Answers2

3

You don't have to use a scanner to parse XML. I've implemented a simple XML parser in Ragel, without a scanner. Here is a blog post with some timings and more info.

Edit: You can do it many ways. You could use a scanner. You could parse for words and if you see STARTANIMAL you start collecting words until you see STOPANIMAL.

Sébastien Le Callonnec
  • 26,254
  • 8
  • 67
  • 80
NateS
  • 5,751
  • 4
  • 49
  • 59
  • is it too much to ask for an example? I would buy you a coffee and ask you to write it on a napkin, but I doubt you're in Madison, WI :) – Seamus Abshere Jun 09 '11 at 14:20
  • http://code.google.com/p/libgdx/source/browse/trunk/gdx/src/com/badlogic/gdx/utils/Xml.rl this link is broken, sorry. – h4ck3rm1k3 Apr 16 '12 at 10:24
  • Updated link: https://code.google.com/p/libgdx/source/browse/trunk/gdx/src/com/badlogic/gdx/utils/XmlReader.rl – NateS Apr 26 '12 at 16:02
  • Updated updated link: https://github.com/libgdx/libgdx/blob/master/gdx/src/com/badlogic/gdx/utils/XmlReader.rl – NateS Oct 31 '12 at 20:43
1

Rephrasing Occam: you do not need the scanner unless you need it. Without scanner you can process one symbol at a time, possibly reading it from the stream with no buffer.

Peter
  • 21
  • 1