-3

I have the following regex from this post:

\<([\w]+)([^\>]*?)(([\s]*\/\>)|(\>((([^\<]*?|\<\!\-\-.*?\-\-\>)|(?R))*)\<\/\1[\s]*\>))

This regex matches any html code.

When I test it on RegExr and regex101 it works fine.

However when I test it with the following code in Python...

re.finditer('\<([\w]+)([^\>]*?)(([\s]*\/\>)|(\>((([^\<]*?|\<\!\-\-.*?\-\-\>)|(?R))*)\<\/\1[\s]*\>))', data):

... I get this error: unexpected end of pattern.

Does anyone know how to fix this?

David Callanan
  • 5,601
  • 7
  • 63
  • 105
  • 1
    `(?R)` is not a supported extension in the Python regex engine. Had you selected the Python engine in regex101 you'd have seen this earlier. – Martijn Pieters Apr 06 '18 at 08:39
  • Is there any library that does? I thought I saw someone else using this in Python. – David Callanan Apr 06 '18 at 08:39
  • The linked post says "Not that I suggest using it, but..." while others correctly say that you should use an HTML parser and not regex. – Alex Hall Apr 06 '18 at 08:40
  • If you press the `python` button on `regex101` it shows *"(? Incomplete group structure ) Incomplete group structure"* error. – Ken Y-N Apr 06 '18 at 08:41
  • [You cannot parse HTML with regex](https://stackoverflow.com/a/1732454/9478968) –  Apr 06 '18 at 08:47

1 Answers1

2

The pattern uses the (?R) recursive pattern modifier, which the Python re module does not support.

You'd have to install the regex project instead, which does support it.

Also, you probably want to use a r raw string literal, to make sure that the Python compiler doesn't interpret those backslashes directly (in this specific case it makes no difference):

>>> import regex
>>> regex.compile(r'\<([\w]+)([^\>]*?)(([\s]*\/\>)|(\>((([^\<]*?|\<\!\-\-.*?\-\-\>)|(?R))*)\<\/\1[\s]*\>))')
regex.Regex('\\<([\\w]+)([^\\>]*?)(([\\s]*\\/\\>)|(\\>((([^\\<]*?|\\<\\!\\-\\-.*?\\-\\-\\>)|(?R))*)\\<\\/\\1[\\s]*\\>))', flags=regex.V0)

However, if you are going to install a 3rd-party library, install BeautifulSoup instead, and use a proper HTML parser to parse HTML.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343