4

OK, so here's what I need :

  • We have the full XML of a Wikipedia article
  • We need just the Infobox section

I have tried various things, but my main issue seems to be not being able to matching "internal" curly brackets. Any ideas (or any regex you have managed to get this done?)

For those of you who do not know what I'm talking about, here's a (somewhat abridged) example of what I'm trying to parse : http://regexr.com?38299

(What is needed is the part between {{Infobox ******* up to its corresponding closing brackets (}}).

Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
  • I'm not entirely certain you *can* do that with regular expressions, but I recall seeing a lot of regular expressions in the part of MediaWiki that renders pages, so maybe you can. – icktoofay Jan 20 '14 at 07:58
  • @icktoofay It's not like I'm stuck with RegEx's. However, since I'm basically testing a scraping framework of mine (written in PHP, and supporting XPaths and RegExs for pattern extraction) I think this is the way to go here (or at least try it :-)) – Dr.Kameleon Jan 20 '14 at 08:00
  • You may want to at least take a look to see how MediaWiki parses them. You might be able to take the same approach. – icktoofay Jan 20 '14 at 08:04
  • Don't use regex. http://stackoverflow.com/a/21107068/1333493 – Nemo Nov 14 '15 at 18:39

1 Answers1

10

Ok, I got it!

Try this..:

(?=\{Infobox)(\{([^{}]|(?1))*\})

Here's the working example:

http://regex101.com/r/kT1jF4

Bryan Elliott
  • 4,055
  • 2
  • 21
  • 22
  • 1
    OK. Let me say just that : *WOW!*. Thanks a lot buddy! (I honestly didn't think anyone could get it right... Let me run several tests and you'll get all the credit you deserve! ;-)) – Dr.Kameleon Jan 20 '14 at 09:23
  • Could you please provide regex for other languages also. Your working example is not working for JS, Python and Golang :( – Damjan Pavlica Jan 10 '19 at 09:03
  • 1
    @DamjanPavlica Unfortunately only regex flavors that support recursion are able to preform this type of match. PHP (PCRE), Perl, And Ruby support regular expression recursion, however Javascript, Golang, and Python do not, however there is an optional [regex library](https://pypi.org/project/regex/) for Python that can be installed that supports regex recursion. Without support for recursion, matching matched corresponding brackets with regex is impossible. – Bryan Elliott Jan 31 '19 at 17:06