6

I'm currently trying to create a software component that would be able to interprete dynamic strings such as:

%TO_LOWER%(%DELETE_WHITESPACES%("A SAMPLE TEXT"))

Which would result in this string:

asampletext

I would like to be able to define a set of available functions, with semantical parameters, etc. I already know (more or less) how to do it using regular expressions.

My questions are:

  • Is lexing/parsing way better than regexp for such a purpose, or should I just go with regexp and forget about that?
  • Does such a library already exist in Java?
  • Do you know any tutorial showing some sample parsing/lexing algorithms?

Thanks!

Nanocom
  • 3,696
  • 4
  • 31
  • 46
  • 5
    Yes, antlr is the solution. You should not use regex for heavy lifting of language parsing. A very good example is in stack overflow - http://stackoverflow.com/questions/1931307/antlr-is-there-a-simple-example – ring bearer Sep 16 '12 at 00:40
  • Often, these custom languages for specific purposes are called [Domain specific language](http://en.wikipedia.org/wiki/Domain-specific_language). – Jesse Webb Jun 11 '13 at 16:56

3 Answers3

7

Is lexing/parsing way better than regexp for such a purpose, or should I just go with regexp and forget about that?

Regexes cannot express a recursive grammar, and your syntax would appear to require a recursive grammar. If this is the case, then regexes simply won't solve the problem.

Does such a library already exist in Java?

This is not a problem that a library would solve. You either need to use a parser generator system (such as Antlr or Javacc) to generate the lexer and parser, or write it / them virtually from scratch. The former approach is probably better ... unless you've taken a Uni-level subject that covers this field, or are prepared to do extensive reading.

Do you know any tutorial showing some sample parsing/lexing algorithms?

Both Antlr and Javacc have extensive tutorial material and examples.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

When not tied to Java-only, you can use another language's PEG parser or Rebol ( it has a parse "dialect" that is PEG-equivalent ) -- or reach WAY back for Icon or Unicon or now even Object Icon at code.google.com/p/objecticon

It was a sorry moment when I realized that the MIT Curl web content language (www.curl.com) had opted for regexp for users even though Curl has macros and offers access to an AST.

general topic : Parser Expression Grammar (PEG) and packrat parsing in general.

Perl use has bequeathed us PCRE, so what can we do but avoid it when not needed ( there are the antlr and bison ... and no doubt they too have their place where they fit easily )

Note: Rebol, Icon and Curl are expression-based languages ( Icon has limited back-tracking ).

Other out-a-the way options include Oz and Mercury ( latter can output erlang )

I am not using pyPEG because I am confined to Python 2.6.6 ; the python parse Lepl is no longer supported - but will install for 2.6

Parsing options in Python include YAPPS at http://theory.stanford.edu/~amitp/yapps/ and various others; note: pyparsing fails to install in some python env's

And for Scala/Java there is this PEG project : https://github.com/sirthias/parboiled/wiki

You may find a Java equiv to peg and leg per http://piumarta.com/software/peg/

CiteSeer has the Ralph Becket article on packrat parsing and Mercury (google for PEG parse mercury site:psu.edu)

There is also a series of 3 blog posts in AdventuresInMercury blog.

0

You could try using Scala on the JVM. It makes it very easy to create DSLs.

Jesse Webb
  • 43,135
  • 27
  • 106
  • 143