0

Possible Duplicate:
Matching Nested Structures With Regular Expressions in Python

I am trying to match a single group of data from a wiki page. The bit of python code I'm using is listed below. The issue is that it returns past the end of its own group to the last }} in the page.

def findPersonInfo(self):
    if (self.isPerson == True):
        regex = re.compile(r"{{persondata(.*)}}",re.IGNORECASE|re.UNICODE|re.DOTALL)
        result = regex.search(self._rawPage)
        if result:
            print 'Match found: ', result.group()

A sample of the wiki page content:

*[http://www.jsc.nasa.gov/Bios/htmlbios/acaba-jm.html NASA biography]

{{NASA Astronaut Group 19}}

{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}
{{DEFAULTSORT:Acaba, Joseph M.}}
[[Category:1967 births]]

My current regex is returning the following string:

{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}
{{DEFAULTSORT:Acaba, Joseph M.}}

I would like it to return:

{{Persondata
|NAME= Acaba, Joseph Michael "Joe"
|ALTERNATIVE NAMES=
|SHORT DESCRIPTION=[[Hydrogeologist]]
|DATE OF BIRTH={{Birth date and age|1967|5|17}}
|PLACE OF BIRTH=[[Inglewood, California]]
|DATE OF DEATH=
|PLACE OF DEATH=
}}

The tricky bit is it needs to count other {{ opens and }} closes to know what group I want to stop at but I'm not sure how to get regex to do that.

Community
  • 1
  • 1
Justin808
  • 20,859
  • 46
  • 160
  • 265
  • You should use a proper [Wiki parser](http://www.mediawiki.org/wiki/Alternative_parsers)... This is not a regular language BTW – JBernardo Sep 20 '12 at 22:23
  • @JBernardo - I dont need a full parser, I just need one section out of the full page that I can split into key/value pairs. – Justin808 Sep 20 '12 at 22:28
  • 2
    @Justin808 that's why you need a parser. Regex doesn't work with languages with arbitrary levels of deepness. You *can* make it work for some specific cases, but you shouldn't – JBernardo Sep 20 '12 at 22:30
  • @JBernardo - I suppose, but I would rather not have to use a full external wiki parser. I was hoping to use a regex result with a few .splits() to get what I needed. I'm trying to use what come pre-installed on the system for this. – Justin808 Sep 20 '12 at 22:33
  • with nested {{ you should really really be using a parser/grammar .... – Joran Beasley Sep 20 '12 at 22:38
  • @Justin808 fear not using other people code when they work better than your. Just prefer ones that can be installed with `pip` and add their names in the [requirements of your `setup.py` file](http://blog.doughellmann.com/2007/11/requiring-packages-with-distutils.html). – JBernardo Sep 20 '12 at 22:47
  • @JBernardo - its not fear, its this was supposed to be a simple scraper to grab some information. Its turning into a bigger deal than I hoped it would be. I don't want to install other python libs on my computer that I'll never use again. I'll just hunt for an existing function I can drop in my code or write one myself. Just odd that python regex doesn't support recursive regex when other implementations in other languages do. I understand the cost to implement the feature, and I know purest say don't use regex use a full parser but that's more than I need for this - in my opinion. – Justin808 Sep 20 '12 at 22:54
  • @Justin808 "odd that python regex doesn't support recursive regex when other implementations in other languages do" What other implementations in other languages are you talking about? I'm not being argumentative. I would like to take a look at them. Thanks. – alan Sep 20 '12 at 23:00
  • @alan - [c# balancing groups](http://msdn.microsoft.com/en-us/library/bs2twtah.aspx), [perl](http://www.perl.com/pub/2003/06/06/regexps.html) – Justin808 Sep 20 '12 at 23:34

2 Answers2

2

{{persondata(.*)}} will match greedily. I.e. it will try to return the longest match possible. You should use {{persondata(.*?)}} if you want to get the shortest possible match. (Is do not have a name for this, maybe frugal matching?)

However, in this case, you have another }} inside your string. You can do something clever like {{persondata((?:.*)}}(?:.*))}}, but in general, as soon as you reach recursive structures (structures that nest themselves) you should abandon regular expressions and turn to proper parsing solutions.

You might want to look at pyparsing.

Hans Then
  • 10,935
  • 3
  • 32
  • 51
  • or `r"{{persondata([^}]*)}}"` assuming there is no nested curly braces ... – Joran Beasley Sep 20 '12 at 22:22
  • 2
    But that would give an equally-wrong result, terminating after `|DATE OF BIRTH={{Birth date and age|1967|5|17}}`. – ruakh Sep 20 '12 at 22:23
  • So the only way to do this in standard python is to not use regex and write my one function to deal with recursive braces? Seems odd, every other language I've used seems to support recursion without an external library. – Justin808 Sep 20 '12 at 22:32
  • 2
    There is a difference between "supporting recursion" and "parsing recursive structures using regular expressions". As they say in compiler class regular expressions cannot count (i.e. they do not maintain a stack to count infinitely nestable structures). It is of course possible to write a nice recursive descent parser using only pure python. – Hans Then Sep 20 '12 at 22:37
  • nested stuf is bad ... regex by definition has no state and therefor cannot match arbitrary nesting ... but you can match this .. but it will break on anything else... – Joran Beasley Sep 20 '12 at 22:39
  • Although depending on your actual requirements, you may just accept the "clever" regular expression and parse your data, as long as you know what you are doing. – Hans Then Sep 20 '12 at 22:39
  • I'll accept as there is no answer and this explains why. The "clever" expression will fail if the person has a death date. – Justin808 Sep 20 '12 at 22:48
0

There's a module on PyPI that was created for this purpose. See mwparserfromhell.

riamse
  • 351
  • 1
  • 4