(If you're impatient, just skip to the Summary section at the bottom)
It is commonly expressed here on Stack Overflow and in the developer community that trying to parse HTML with regular expressions ("regexes") is a bad idea. To quote Jeff Atwood of Coding Horror:
So, while I may attempt to parse HTML using regular expressions in certain situations, I go in knowing that:
- It's generally a bad idea.
- Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.
- I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.
Reasons Why Regexes Are Bad for HTML Parsing
Some of the reasons why seem to fall into these categories:
You can't use it to parse arbitrary HTML, because there are known cases where a regular expression wouldn't work.
Regexes don't handle invalid HTML properly (is this just an example of point #1 above?).
HTML is a "Chomsky Type 2 grammar (context free grammar)", while regular expressions are a "Chomsky Type 3 grammar (regular grammar)".
Somestimes Regexes Are OK for HTML Parsing?
However, people have also mentioned that in some cases, it's okay to parse a limited set of known HTML:
[I]t's sometimes appropriate to parse a limited, known set of HTML.
I think that's just as wrongheaded as demanding every trivial HTML processing task be handled by a full-blown parsing engine. It's more important to understand the tools, and their strengths and weaknesses, than it is to knuckle under to knee-jerk dogmatism.
I Don't Get It :(
I've never understood in which circumstances it's "appropriate" to parse HTML using a regex, as the two quotes above suggest. I guess it's because I don't really understand the situations where regexes don't really work:
So apparently regexes don't work when the HTML isn't even valid, is that right?
What if you can expect your input HTML to always be valid? Is it ok to parse it with regexes then?
Yes, I've seen this Stack Overflow question with examples already. No, the answers don't really help...this one, in particular, lacks explanation.
I'm bringing this question up now because I've been reading some of the source code for Ruby ERB and jQuery, and they use regexes to parse HTML strings! So why do they use regexes instead of an HTML parser? Why do regexes not lead to some kind incorrect behavior in these cases?
Ruby ERB Source Code
So here's the source code from ERB that's using regex to parse templates:
def scan_line(line)
line.scan(/(.*?)(<%%|%%>|<%=|<%#|<%|%>|\n|\z)/m) do |tokens|
tokens.each do |token|
next if token.empty?
yield(token)
end
end
end
I've tested this out using the code below, and sure enough, scan_line
correctly tokenizes the template, parsing out HTML and ERB tags:
t = <<TEMPLATE
<div>
<% cupcakes.each do |c| %>
<p>Oh boy, another cupcake!</p>
<ul>
<li>Flavor: <%= c.flavor %></li>
<li>Price: <%= c.price %></li>
</ul>
<% end %>
</div>
TEMPLATE
t.split("\n").each do |line|
scan_line(line) { |token| puts token }
end
This produces the following output:
<div>
<%
cupcakes.each do |c|
%>
<p>Oh boy, another cupcake!</p>
<ul>
<li>Flavor:
<%=
c.flavor
%>
</li>
<li>Price:
<%=
c.price
%>
</li>
</ul>
<%
end
%>
</div>
jQuery Source Code
Here's the regex in jQuery's source code:
define(function() {
// Match a standalone tag
return (/^<(\w+)\s*\/?>(?:<\/\1>|)$/);
});
I've tested this out in my browser console, and it seems that it will only match plain HTML tags, i.e. tags without attributes and text content. For example:
/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<p>Hello!</p>');
// null
/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<img src="foo.jpg"/>');
// null
/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<img/>');
// ["<img/>", "img"]
/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<img>');
// ["<img/>", "img"]
/^<(\w+)\s*\/?>(?:<\/\1>|)$/.exec('<div></div>')
// ["<div></div>", "div"]
TL;DR Summary
The Ruby ERB and jQuery source code above use regexes to parse HTML strings! So why do they use regexes instead of an HTML parser? Why do regexes not lead to some kind incorrect behavior in these cases?
If you can expect your input HTML to always be valid, is it then ok to parse it with regexes?