3

Given a wikiText string such as:

{{ValueDescription
    |key=highway
    |value=secondary
    |image=Image:Meyenburg-L134.jpg
    |description=A highway linking large towns.
    |onNode=no
    |onWay=yes
    |onArea=no
    |combination=
    * {{Tag|name}}
    * {{Tag|ref}}
    |implies=
    * {{Tag|motorcar||yes}}
    }}

I'd like to parse templates ValueDescription and Tag in Java/Groovy. I tried with with regex /\{\{\s*Tag(.+)\}\}/ and it's fine (it returns |name |ref and |motorcar||yes), but /\{\{\s*ValueDescription(.+)\}\}/ doesn't work (it should return all the text above).

The expected output

Is there a way to skip nested templates in the regex?

Ideally I would rather use a simple wikiText 2 xml tool, but I couldn't find anything like that.

Thanks! Mulone

Mulone
  • 3,603
  • 9
  • 47
  • 69

2 Answers2

4

Arbitrarily nested tags won't work since that's makes the grammar non-regular. You need something capable of dealing with a context-free grammar. ANTLR is a fine option.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
2

Create your regex pattern using Pattern.DOTALL option like this:

Pattern p = Pattern.compile("\\{\\{\\s*ValueDescription(.+)\\}\\}", Pattern.DOTALL);

Sample Code:

Pattern p=Pattern.compile("\\{\\{\\s*ValueDescription(.+)\\}\\}",Pattern.DOTALL);
Matcher m=p.matcher(str);
while (m.find())
   System.out.println("Matched: [" + m.group(1) + ']');

OUTPUT

Matched: [
|key=highway
|value=secondary
|image=Image:Meyenburg-L134.jpg
|description=A highway linking large towns.
|onNode=no
|onWay=yes
|onArea=no
|combination=
* {{Tag|name}}
* {{Tag|ref}}
|implies=
* {{Tag|motorcar||yes}}
]

Update

Assuming closing }} appears on a separate line for {{ValueDescription following pattern will work to capture multiple ValueDescription:

Pattern p = Pattern.compile("\\{\\{\\s*ValueDescription(.+?)\n\\}\\}", Pattern.DOTALL);
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • this works but if there's another '''{{ValueDescription''' block it won't stop. – Mulone Jun 03 '11 at 15:12
  • @Mulone: Assuming closing `}}` appears on a separate line for `{{ValueDescription` following pattern will work to capture multiple `ValueDescription`: `Pattern p = Pattern.compile("\\{\\{\\s*ValueDescription(.+?)\n\\}\\}", Pattern.DOTALL);` – anubhava Jun 03 '11 at 15:29
  • I don't think that that assumption is valid when reading wikitext. Is there a way to make it robust? – Mulone Jun 03 '11 at 16:02
  • @Mulone: Regular expressions do have limitations here, you need to have some type of pattern to match. Closing `}}` must be either on a separate line or be followed by some other character that we can use in the pattern above. For validating/matching a non-regular text you will eventually need a parser utility or will need to write your own parser. – anubhava Jun 03 '11 at 16:09