Problem : I need to match a content within a large text (Wikipedia dump consisting of xml pages) in java.
Content required: Infobox
Reg ex used : "\\{\\{Infobox(.*?)\\}\\}"
Issue: the above pattern matches the first occurrence of }} within the infobox and if I remove the ? character in the reg ex, the pattern matches the last occurrence. But, I am looking for extracting just the infobox and }} should match the end of the info box.
Ex info box:
{{infobox RPG
|title= Amber Diceless Roleplaying Game
|image= [[Image:Amber DRPG.jpg|200px]]
|caption= Cover of the main ''Amber DRPG'' rulebook (art by [[Stephen Hickman]])
|designer= [[Erick Wujcik]]
|publisher= [[Phage Press]]<br>[[Guardians of Order]]
|date= 1991
|genre= [[Fantasy]]
|system= Custom (direct comparison of statistics without dice)
|footnotes=
}}
Code snippet:
String regex = "\\{\\{Infobox(.*?)\\}\\}";
Pattern p1 = Pattern.compile(regex, Pattern.DOTALL);
Matcher m1 = p1.matcher(xmlPage.getText());
String workgroup = "";
while(m1.find()){
workgroup = m1.group();
}