0

(If you're impatient, just skip to the Summary section at the bottom)

It is commonly expressed here on Stack Overflow and in the developer community that trying to parse HTML with regular expressions ("regexes") is a bad idea. To quote Jeff Atwood of Coding Horror:

So, while I may attempt to parse HTML using regular expressions in certain situations, I go in knowing that:

  • It's generally a bad idea.
  • Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
  • I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.

Reasons Why Regexes Are Bad for HTML Parsing

Some of the reasons why seem to fall into these categories:

  1. You can't use it to parse arbitrary HTML, because there are known cases where a regular expression wouldn't work.

  2. Regexes don't handle invalid HTML properly (is this just an example of point #1 above?).

  3. HTML is a "Chomsky Type 2 grammar (context free grammar)", while regular expressions are a "Chomsky Type 3 grammar (regular grammar)".

Somestimes Regexes Are OK for HTML Parsing?

However, people have also mentioned that in some cases, it's okay to parse a limited set of known HTML:

[I]t's sometimes appropriate to parse a limited, known set of HTML.

Jeff Atwood argues:

I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.

I Don't Get It :(

I've never understood in which circumstances it's "appropriate" to parse HTML using a regex, as the two quotes above suggest. I guess it's because I don't really understand the situations where regexes don't really work:

  1. So apparently regexes don't work when the HTML isn't even valid, is that right?

  2. What if you can expect your input HTML to always be valid? Is it ok to parse it with regexes then?

Yes, I've seen this Stack Overflow question with examples already. No, the answers don't really help...this one, in particular, lacks explanation.

I'm bringing this question up now because I've been reading some of the source code for Ruby ERB and jQuery, and they use regexes to parse HTML strings! So why do they use regexes instead of an HTML parser? Why do regexes not lead to some kind incorrect behavior in these cases?

Ruby ERB Source Code

So here's the source code from ERB that's using regex to parse templates:

def scan_line(line)
  line.scan(/(.*?)(<%%|%%>|<%=|<%#|<%|%>|\n|\z)/m) do |tokens|
    tokens.each do |token|
      next if token.empty?
      yield(token)
    end
  end
end

I've tested this out using the code below, and sure enough, scan_line correctly tokenizes the template, parsing out HTML and ERB tags:

t = <<TEMPLATE
<div>
  <% cupcakes.each do |c| %>
    <p>Oh boy, another cupcake!</p>
    <ul>
      <li>Flavor: <%= c.flavor %></li>
      <li>Price: <%= c.price %></li>
    </ul>
  <% end %>
</div>
TEMPLATE

t.split("\n").each do |line|
  scan_line(line) { |token| puts token }
end

This produces the following output:

<div>

<%
 cupcakes.each do |c|
%>
    <p>Oh boy, another cupcake!</p>
    <ul>
      <li>Flavor:
<%=
 c.flavor
%>
</li>
      <li>Price:
<%=
 c.price
%>
</li>
    </ul>

<%
 end
%>
</div>

jQuery Source Code

Here's the regex in jQuery's source code:

define(function() {
  // Match a standalone tag
  return (/^<(\w+)\s*\/?>(?:<\/\1>|)$/);
});

I've tested this out in my browser console, and it seems that it will only match plain HTML tags, i.e. tags without attributes and text content. For example:

/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<p>Hello!</p>');
// null

/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<img src="foo.jpg"/>');
// null

/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<img/>');
// ["<img/>", "img"]

/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<img>');
// ["<img/>", "img"]

/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<div></div>')
// ["<div></div>", "div"]

TL;DR Summary

The Ruby ERB and jQuery source code above use regexes to parse HTML strings! So why do they use regexes instead of an HTML parser? Why do regexes not lead to some kind incorrect behavior in these cases?

If you can expect your input HTML to always be valid, is it then ok to parse it with regexes?

Community
  • 1
  • 1

4 Answers4

2

As is stated by Casper in the comments, ERB is processing its own language with its own parsing rules, not HTML, so that's a red herring. Similarly, jQuery in the example you give is not trying to parse general HTML, just a tiny subset of it.

There are a couple of situations where the use of regex is appropriate. If you can throw away everything you know about the syntax and structure of HTML and treat the input as a simple text file, then a regex can work.

The other thing to take into account is the consequences of errors. If you try and regex large numbers of random HTML files for say, sampling purposes, you will get some false positive and some false negative matches. But if most potential matches are correct, that may give you the output you need to a sufficient degree of accuracy.

Which brings us back to jQuery. The HTMLish strings that the sample code is processing are only consumed by jQuery. So the match will either work or fail. If it fails it will be obvious to the developer of the client code because it won't do what the developer intends it to do. The same does not apply to general HTML. The author of the HTML will have tested in browsers, which use a parser, not regex, and established that it does what the author wants in that context. If your code is processing it in a different way, you are taking on all the risk for the false positives and false negatives.

To briefly address your final question, validity is irrelevant.

And incidentally, I doubt that a full blown html parsing engine is any more complex than a full blown regular expression engine. It's just that sometimes a regex engine is closer to hand.

One other point. It's worth taking into account the social context here. Often we see people turn up on Stack Overflow, saying something like "I'm trying to process some HTML with my regex, it's not working and I'm stuck, how can I fix it?" The fact that you're stuck is a big clue that you should be using a parser.

Alohci
  • 78,296
  • 16
  • 112
  • 156
  • Your point about general HTML being validated against a parser vs regex is very interesting. I'm a little confused though about what you mean by "throwing away everything you know about the syntax and structure of HTML" in such a way that a regex will work. –  Mar 24 '14 at 06:44
  • It's hard to pin down exactly what that means it every situation, but look at it like this. An HTML document is made up of markup and content. If you can construct your regex such that it's measuring the content and not the markup, then its use may be appropriate. If your regex is full of angle brackets and single and double quotes then you're on to a loser. – Alohci Mar 24 '14 at 07:36
2

Answer based on our discussion above:

ERB is not parsing HTML. It's parsing ERB. There's a big difference there.

ERB looks structurally similar to HTML though, why is it different? – Cupcake

I think you might be confusing pattern matching with parsing. Pattern matching simple HTML constructs is in general OK when you need to do a simple task quickly. Most of your examples fall more into the pattern matching category. But parsing is another thing.

Parsing means building a coherent data structure of some predefined language by utilizing lexical and contextual analysis. When you talk about parsing HTML with regexes that is what is commonly understood to be what you are trying to do.

It's a very complex process because HTML is complex. ERB is not complex, ERB is simple. Therefore ERB can be "parsed" by just utilizing simple pattern matching rules. That's the difference.

Casper
  • 33,403
  • 4
  • 84
  • 79
0

I guess the main argument will be that DOM or HTML parsing can be done only with valid DOM or HTML input and a bug free DOM / HTML parser library. I expect especially jQuery has to deal with such issues.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
0

ERb has absolutely nothing whatsoever to do with HTML. The ERb library parses ERb, not HTML. ERb has been specifically designed to be trivial to parse with Ruby's Regexps.

If ERb were using an HTML parser, then how could it parse database.yml, which is YAML, not HTML? How could it parse .js.erb, which is ECMAScript, not HTML?

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • But ERb tags have a similar structure to HTML elements, so I thought it would have the same property of being difficult to parse with regex, is that not the case? –  Mar 24 '14 at 06:35