30

I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?

yatakaka
  • 357
  • 1
  • 4
  • 10
  • maybe because there are already thousands of xml parsers including parsers _built into_ programming languages, and frameworks such as GTK. – ApprenticeHacker Dec 20 '11 at 14:36
  • 2
    @Michael waiting for the link. – ApprenticeHacker Dec 20 '11 at 14:37
  • 4
    You can use regex for extracting bits of information from small, predictable, restricted snippets of XML, no problem, but regex is not meant for **parsing** XML as a whole. It's like using a ball-peen hammer to peel an orange. – BoltClock Dec 20 '11 at 14:37
  • 2
    It actually is a good question - it would be good to have a definitive answer here, which could be referred to whenever there are questions regarding parsing XML with regular expressions... – Avi Dec 20 '11 at 14:38
  • 2
    This answer is about parsing HTML, but nevertheless insightful: http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 – martin clayton Dec 20 '11 at 14:51
  • 1
    Rather: Why is it such a bad idea to search the forum before asking a question? – ThomasRS Dec 20 '11 at 14:55
  • 3
    The best answer is, http://stackoverflow.com/a/1732454/135078 (Beware Zalgo) – Kelly S. French Jan 12 '12 at 22:38

3 Answers3

50

The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse...

<div>
    <div id="parse-this">
        <!-- oops</div> -->
        try to get this value with regex
    </div>
</div>

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

Steve Wortham
  • 21,740
  • 5
  • 68
  • 90
  • 1
    You should throw in some numeric character entities or DTD-defiend entities just to make it harder :-p. – binki Oct 29 '14 at 19:54
9

This has been discussed so many times here on SO. See e.g.

Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

Just follow the links on the right side of the screen to more answers.

My conclusion:

Simple, because a regular expression is not a parser, its a tool to find patterns.

If you want to find a very specific pattern in a (ht|x)ml file, go on, regex is perfect for that.

But if you are searching for something in in every Foo tag, that could have attributes in different orders, that can be nested, that can be malformed (and still valid), then use a parser, because thats not pattern matching anymore.

Community
  • 1
  • 1
stema
  • 90,351
  • 20
  • 107
  • 135
  • Xpath is sort of Regex for XML. the problem is that regexs don understand recursion. – AK_ Oct 22 '13 at 21:19
  • 2
    @AK_ XPath is not a sort of Regex. *[XPath](http://en.wikipedia.org/wiki/XPath) is a query language for selecting nodes from an XML document*. That has nothing to do with regex. And I doubt that you have understood my answer. The problem is not that regexes don't understand recursion, they do: [see regular-expression.info](http://www.regular-expressions.info/recurse.html). The problem is that (ht|x)ml can look so different, but have the same result. With a lot of effort [you can parse (ht|x)ml with regex](http://stackoverflow.com/a/4234491/626273), but an existing parser is much simpler to use – stema Oct 23 '13 at 06:41
  • 1. What your are referring to, are extensions. These are not regular expressions, in the ComSci sense. 2. Please read [this](http://en.wikipedia.org/wiki/Chomsky_hierarchy) and the background stuff. it's easy to formulate an xml document that would be impervious to regex. 3. XPath and Xsd , can be used **in practice** for some of the things that can be done with Regex, Like validation, and looking for stuff in documents. they are similar in the.... rhetorical sense :-) – AK_ Oct 23 '13 at 18:28
  • @AK_, I am talking about regexes as used in todays programming languages, not about regular languages as defined by the chomsky hierarchy. As I understood regexes are not regular anymore since the introduction of backreferences, but thats not my topic and in 99,99% of the questions here it is also not the topic. I agree totally with your point 2. That is what I try to say the whole time. (Maybe I did not do a good job :-( ) – stema Oct 23 '13 at 19:29
6

XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression. You might be successful 99% of the time, but then someone will find a way of writing the XML that throws you.

If you're writing some kind of screen-scraper then a 99% success rate might be adequate. For most applications, it isn't.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • 2
    Regular expressions were initially designed to handle regular languages only, but modern implementations include lookarounds , backreferences, and sometimes balanced matching. That allows you to venture into slightly more complex language... But it still isn't sufficient for something as complex as XML or html. – Steve Wortham Dec 20 '11 at 17:22
  • 3
    I've never seen an attempt to parse XML using a regex that won't break on some content (e.g. something suitably XML-like inside a comment or CDATA section). So the only acceptable situation for using a regex is where you don't mind if it doesn't always work. – Michael Kay Dec 24 '11 at 00:12
  • I agree. I only wanted to mention the whole regular language thing because I once made the same argument, and then later realized my mistake. – Steve Wortham Dec 24 '11 at 00:48
  • Natural language in isolation is barely regular enough. Even on something as [theoretically isolatable as "tag split" or "search term split"](https://github.com/marrow/util/blob/develop/marrow/util/convert.py?ts=4#L191-L199). Using those two as examples: `r'[\s \t,]*("[^"]+"|\'[^\']+\'|[^ \t,]+)[ \t,]*'` and `r'[\s \t]*([+-]?"[^"]+"|\'[^\']+\'|[^ \t]+)[ \t]*'` respectively. I throw up a little in my mouth thinking about the fact I wrote a generator for these abominations. ;^P And this is still (extremely) fragile to quote balances! – amcgregor Jun 17 '19 at 13:58