-2

[EDIT]:

I Understand that regular expression are not made to parse XML, but my question is about WHY THAT REGULAR EXPRESSION DOESN'T COMPILE IN PYTHON.

I'm expecting answers about WHAT DOSN'T WORK in that regular expression and NOT WHY IT IS NOT A GOOD IDEA TO USE IT (and I don't understand the down votes).

[/EDIT]

I'm trying to write a function that escapes the contend of an XML tag according to this document and I think that the best solution is to escape all the "<" and "&" that are not in a CDATA section.

I have a basic knowledge of regular expressions, so I looked around and I found this page and this one.

So apparently the regular expression that works with the "&" is:

&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)

but it doesn't work in python, in fact if I try to use it I have:

In [1]: import re

In [2]: x = re.compile('&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)')
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-2-2884ec1d2f4e> in <module>()
----> 1 x = re.compile('&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)')

/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in compile(pattern, flags)
    188 def compile(pattern, flags=0):
    189     "Compile a regular expression pattern, returning a pattern object."
--> 190     return _compile(pattern, flags)
    191 
    192 def purge():

/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in _compile(*key)
    243         p = sre_compile.compile(pattern, flags)
    244     except error, v:
--> 245         raise error, v # invalid expression
    246     if len(_cache) >= _MAXCACHE:
    247         _cache.clear()

error: unexpected end of pattern

and this make me think that that regular expression is not written for python.

Any help?

Community
  • 1
  • 1
Giovanni Di Milia
  • 13,480
  • 13
  • 55
  • 67

2 Answers2

4

Your regular expression fails to compile because the (?> ...) syntax for atomic grouping (aka "independent subexpressions") is not supported by Python's re module. There is an experimental reimplementation of re available on PyPI that does support atomic groups and other nice features, so you could try that instead.

user4815162342
  • 141,790
  • 18
  • 296
  • 355
2

XML is not a regular language. Thus, you cannot parse it correctly with regular expressions.

Use and customize an XML parser, such as BeautifulSoup, instead.

For a more comprehensive answer, see the related question "RegEx match open tags except XHTML self-contained tags".

Community
  • 1
  • 1
jsalonen
  • 29,593
  • 15
  • 91
  • 109
  • I agree, but my question is more about the regular expression itself and not the XML – Giovanni Di Milia Nov 01 '12 at 14:30
  • Plus where I'm not trying to parse a valid XML (otherwise I would have used libxml2) but only strings that have something that resembles XML but that is not always valid XML. – Giovanni Di Milia Nov 01 '12 at 14:32
  • 3
    Regular expressions, as widely implemented today, are not limited to regular languages. The OP's regexp uses an independent subexpression feature not supported by Python. – user4815162342 Nov 01 '12 at 14:34
  • Have you tried BeautifulSoup? I think you could first run it to make the best sense out of the invalid XML. – jsalonen Nov 01 '12 at 14:36
  • 1
    user4815162342: can you please add that as an answer? It's a very valuable piece of advice! – jsalonen Nov 01 '12 at 14:38
  • @jsalonen Yup, I just did—after some digging to find what the feature is called in `re`-to-be. – user4815162342 Nov 01 '12 at 14:45
  • 2
    Note that my comment and answer should not be construed as condoning the use of regexps for general XML parsing. Even taking into account the abilities of modern regexp engines, parsing XML with regexps will be slow and incomplete at best, and completely broken at worst. – user4815162342 Nov 01 '12 at 14:48
  • I hear you. Even though regexps might be the wrong way to go, I appreciate the fact you described what the related problem in fact was. – jsalonen Nov 01 '12 at 14:50