2

In followup to my previous question Hundreds of RegEx on one string I ended up with a regex like following

(section1:|section2:|section3:|section[s]?4:|(special section:|it has:|synonyms:)).*?(?=section1:|section2:|section3:|section[s]?4:|(special section:|it has:|synonyms:)|$)

section section in regex search

The regex that I have in my prod system has more then 1000 characters and is multiple lines long. All it does is chunking sections from big piece of text and then again these sections are individually processed to extract information. Also I want these section titles to be natural language tolerant that's why some sections can be typed in multiple ways resulting in increased size of the regex. Is there a better way of doing this in terms of performance and manageability?

Community
  • 1
  • 1
Sap
  • 5,197
  • 8
  • 59
  • 101

3 Answers3

4

Use a lexical analyzer instead of regex.

xpda
  • 15,585
  • 8
  • 51
  • 82
1

Maybe try a parser generator like one of those discussed in What's better, ANTLR or JavaCC? ?

If you have a natural language grammar then you typically have repeated sub-grammars to allow reordering. A proper grammar for that is going to be much easier to maintain than a regular expression.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • 1
    I looked into ANTLR and it seemed to me that it is good if text is in way more structured format. On the other side I am working on Natural Language which tends to be unstructured. Often people forget to punctuate statements or there might be more then one way to express the same thing – Sap Sep 14 '11 at 09:44
1
  1. For dealing with performance in such regexp you can use prefix optimisation https://code.google.com/p/graph-expression/wiki/RegexpOptimization

  2. This framework allow you to write typechecked regexp with Java DSL. So it became refactorable and maintainable. https://code.google.com/p/graph-expression/

yura
  • 14,489
  • 21
  • 77
  • 126
  • The option given is very good, but i stay a little confused in terms of how can I use it. The regex I have posted above does not only select the section headers but it also selects the content of them. How can I use GExp to achieve that? – Sap Sep 15 '11 at 08:57
  • @Grrrrr Oh... I think I can just use the generated regex twice to do so. – Sap Sep 15 '11 at 08:59
  • Ok ) Regexp tool just allow you to create regexp from list of string(or other regexps). SO you can pass it to PAttern.compile and then extract content from Matcher. There are option not to generate caputerd groups so you can combine it with other regexp and get field via Matcher.group(number). GExp is high level regexp means first you write lexer to create tokens and then regexp upon it, related tool is GATE JAPE – yura Sep 15 '11 at 10:50
  • Hey, I am trying to run the example given at https://code.google.com/p/graph-expression/wiki/Examples do you know where the methods like "match" and "seq" are coming from? Is that a static import or is it an inherited class. – Sap Sep 15 '11 at 13:03
  • Yep, there are static imports of GraphUtils. You can find all this code in test sources. – yura Sep 15 '11 at 15:59