1

I am trying to parse a wikipedia page and need to extract a particular section of the page using regex. In the below data, I just need to extract the data inside {{Infobox...}} section.

{{Infobox XC Championships
|Name       = Senior men's race at the 2008 IAAF World Cross Country Championships
|Host city  = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}}
|Location   = [[Holyrood Park]]
|Nations participating  = 45
}}
2008.<ref name=iaaf_00>
{{ Citation 
| last = 
| publisher = [[IAAF]]
}}

So in the above example, i need to extract only

Infobox XC Championships
|Name       = Senior men's race at the 2008 IAAF World Cross Country Championships
|Host city  = [[Edinburgh]], [[Scotland]], [[United Kingdom]] {{flagicon|United Kingdom}}
|Location   = [[Holyrood Park]]
|Nations participating  = 45

Please note that there might be nested {{ }} characters within {{Infobox...}} section. I don't want to omit that.

Below is my regex:

\\{\\{Infobox[^{}]*\\}\\} 

but it doesn't seem to work. Please help. Thanks!

Ankit
  • 201
  • 4
  • 12
  • 1
    Regex is not designed to handle nesting. – Amber Nov 18 '13 at 06:59
  • Thanks @Amber . Can you suggest me the best way to do this? – Ankit Nov 18 '13 at 07:00
  • One way would be to iterate over the string, counting opening braces and stopping when you pass an equal number of closing braces. – Amber Nov 18 '13 at 07:04
  • Actually I have around 45GB dump and I suppose a lot of time will be spent in processing if I will proceed through this brute force method. This is the reason I was looking for regex so that it can be a bit faster. – Ankit Nov 18 '13 at 07:06
  • 1
    If you are worried about performance, regexes will almost never solve your problems. Regexes are an easy, convenient way to solve certain problems, but are generally notorious for poor performance. – femtoRgon Nov 18 '13 at 07:13
  • I think I am okay with it as there is no other feasible way I can think currently. – Ankit Nov 18 '13 at 07:15
  • 1
    Note that "this brute force method" is not actually any different from what a regex is doing, in terms of the amount of data processed... – Amber Nov 18 '13 at 07:18
  • Actually i don't know much about regex. I didn't mean to be offensive. I was looking to use regex. Thanks! – Ankit Nov 18 '13 at 07:20
  • See [mediawiki api: how to get infobox from a wikipedia article](http://stackoverflow.com/questions/7638402/mediawiki-api-how-to-get-infobox-from-a-wikipedia-article), and [Getting the infobox section of wikipedia](http://stackoverflow.com/questions/3312346/getting-the-infobox-section-of-wikipedia). And on regexes, strictly speaking, this is impossible, see [Can regular expressions be used to match nested patterns?](http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns). – femtoRgon Nov 18 '13 at 07:22

2 Answers2

4

Due to the formatting of the infobox-section it actually is possible to use a regular expression for this.
Trick is, that you don't even take care of the nested {{...}} elements, as each of them will be in its own line starting with a |.

{{(Infobox.*\r\n(?:\|.*\r\n)+)}}

Regular expression visualization

Debuggex Demo

{{           start of the string
  (Infobox   start of the capturing group
  .*\r\n     any characters until a line break appears
  (?:        
    \|       line has to start with a |
    .*\r\n   any characters until a line break appears
  )          
  +          the non-capturing group can occur multiple times
  )          end of capturing group
}}           

So, within the Infobox-section you just match lines beginning with a | until }} pops up.

You may have to experiment with \r\n depending on your platform/language. Debuggex was fine with \r\n, but regex101.com would only match on \n

KeyNone
  • 8,745
  • 4
  • 34
  • 51
  • I was not aware that a | has to be there at the start of each line in the infobox section. Thanks a lot. It seems to be working :) – Ankit Nov 18 '13 at 16:58
  • @Kailash to be honest, I don't know if a `|` **has** to be there, as I don't know the syntax wikipedia uses. But in the example you posted it looked like a part of the pattern to me. – KeyNone Nov 18 '13 at 17:00
0

Don't use regex..Follow this algorithm

1>Initialize counter to 0

2>Increment counter when you find {{

3>Decrement counter when you find }}

4>Repeat step 2 and 3 until counter is 0

Anirudha
  • 32,393
  • 7
  • 68
  • 89