Regular expression to extract a section from the wikipedia page

Question

I am trying to parse a wikipedia page and need to extract a particular section of the page using regex. In the below data, I just need to extract the data inside {{Infobox...}} section.

{{Infobox XC Championships
|Name       = Senior men's race at the 2008 IAAF World Cross Country Championships
|Host city  = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}}
|Location   = [[Holyrood Park]]
|Nations participating  = 45
}}
2008.&lt;ref name=iaaf_00&gt;
{{ Citation 
| last = 
| publisher = [[IAAF]]
}}

So in the above example, i need to extract only

Infobox XC Championships
|Name       = Senior men's race at the 2008 IAAF World Cross Country Championships
|Host city  = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}}
|Location   = [[Holyrood Park]]
|Nations participating  = 45

Please note that there might be nested {{ }} characters within {{Infobox...}} section. I don't want to omit that.

Below is my regex:

\\{\\{Infobox[^{}]*\\}\\}

but it doesn't seem to work. Please help. Thanks!

One way would be to iterate over the string, counting opening braces and stopping when you pass an equal number of closing braces. — Amber, Nov 18 '13 at 07:04
Actually I have around 45GB dump and I suppose a lot of time will be spent in processing if I will proceed through this brute force method. This is the reason I was looking for regex so that it can be a bit faster. — Ankit, Nov 18 '13 at 07:06
If you are worried about performance, regexes will almost never solve your problems. Regexes are an easy, convenient way to solve certain problems, but are generally notorious for poor performance. — femtoRgon, Nov 18 '13 at 07:13
I think I am okay with it as there is no other feasible way I can think currently. — Ankit, Nov 18 '13 at 07:15
Note that "this brute force method" is not actually any different from what a regex is doing, in terms of the amount of data processed... — Amber, Nov 18 '13 at 07:18
Actually i don't know much about regex. I didn't mean to be offensive. I was looking to use regex. Thanks! — Ankit, Nov 18 '13 at 07:20
See [mediawiki api: how to get infobox from a wikipedia article](http://stackoverflow.com/questions/7638402/mediawiki-api-how-to-get-infobox-from-a-wikipedia-article), and [Getting the infobox section of wikipedia](http://stackoverflow.com/questions/3312346/getting-the-infobox-section-of-wikipedia). And on regexes, strictly speaking, this is impossible, see [Can regular expressions be used to match nested patterns?](http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns). — femtoRgon, Nov 18 '13 at 07:22

KeyNone · Accepted Answer · 2013-11-18T08:53:59.163

Due to the formatting of the infobox-section it actually is possible to use a regular expression for this.
Trick is, that you don't even take care of the nested {{...}} elements, as each of them will be in its own line starting with a |.

{{(Infobox.*\r\n(?:\|.*\r\n)+)}}

Regular expression visualization

Debuggex Demo

{{           start of the string
  (Infobox   start of the capturing group
  .*\r\n     any characters until a line break appears
  (?:        
    \|       line has to start with a |
    .*\r\n   any characters until a line break appears
  )          
  +          the non-capturing group can occur multiple times
  )          end of capturing group
}}

So, within the Infobox-section you just match lines beginning with a | until }} pops up.

You may have to experiment with \r\n depending on your platform/language. Debuggex was fine with \r\n, but regex101.com would only match on \n

I was not aware that a | has to be there at the start of each line in the infobox section. Thanks a lot. It seems to be working :) — Ankit, Nov 18 '13 at 16:58
@Kailash to be honest, I don't know if a `|` **has** to be there, as I don't know the syntax wikipedia uses. But in the example you posted it looked like a part of the pattern to me. — KeyNone, Nov 18 '13 at 17:00

score 0 · Answer 2 · answered Nov 18 '13 at 08:12

0

Don't use regex..Follow this algorithm

1>Initialize counter to 0

2>Increment counter when you find {{

3>Decrement counter when you find }}

4>Repeat step 2 and 3 until counter is 0

answered Nov 18 '13 at 08:12

Anirudha

32,393
7
68
89

Regular expression to extract a section from the wikipedia page

2 Answers2

Linked