Parsing HTML - is regex the only option in this case?

Question

A user will supply HTML, it may be valid or invalid (malformed). I need to be able to determine such things as:

Is there a style tag in the body
Is there a div that has a style attribute that makes use of width or background-image.

I have tried using the DOMDocument class but it can only do 1 and not 2 with xPath.

I have also tried simple_html_dom and that can only do 1 but not 2.

Do you think its a good idea that I just use regular expressions or is there something that I haven't thought of?

[No, not at all!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Marcel Korpel, Jun 10 '11 at 15:21
Both of them can do 2, with the caveat that you'll have to parse the `style` attribute yourself. _That_ part you may use regex for. — Lightness Races in Orbit, Jun 10 '11 at 15:22
How malformed is it? There are HTML parsers, seperate from XML parsers, which tend to accomdidate the the issues which disallow HTML from being parsed with an XML parser. — Maz, Jun 10 '11 at 15:23
Can someone give an xPath query example for the second option? @Tomalak - if thats the case, why I don't I just regex alltogether as I can 1 without having any problems? — Abs, Jun 10 '11 at 15:25
@Maz - can you name one? When I say malformed, I mean a user forgets a closing tag for example. — Abs, Jun 10 '11 at 15:25
@Abs: Because regexing a one-line `style` attribute is _completely_ different from regexing an HTML document. You might as well have asked me, "well if I can regex a `style` attribute, why can't I regex a giant hippopotamus?" It's a complete non sequitur. — Lightness Races in Orbit, Jun 10 '11 at 15:26
Wouldn't this XPath work for the second option: `//div[contains(@style,'width:') or contains(@style,'background-image:')`? — Austin Hyde, Jun 10 '11 at 15:31
@Austin - very interesting. How about if I want to see if a style block has a:hover or a font-size - can xpath deal with this? — Abs, Jun 10 '11 at 15:37
@Abs - First, note that I forgot the ending `]` on that XPath, second, XPath treats the entire HTML document as XML, not just the body, so searching a ` — Austin Hyde, Jun 10 '11 at 15:52
@Austin - do you think you can add an answer of what you just said. I think you're right, xPath should be able to work for both cases. I need to test it. — Abs, Jun 10 '11 at 15:55

Regexident · Answer 1 · 2011-06-10T15:33:59.260

3

Regex is NEVER (again: NEVER!) a solution for parsing HTML!

Regex can be used for Type-3 Chomsky languages (regular language).
HTML however is a Type-2 Chomsky language (context-free language).

If still in doubt: http://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy

To safely work with type-2 language you need a context free language parser. You might want to try a LL-parser or a recursive descent parser, e.g.

That being said:

Match body with style:

<body\s+[^>]*style\s*=\s*["'].*?[^"']*?["'][^>]*>

Match div with width|background-image in style:

<div\s+[^>]*style\s*=\s*["'][^"']*?(width|background-image)[^"']*?["'][^>]*>

They both falsely match said tags if commented out (which is why I said not possible).

edited Jun 10 '11 at 15:33

answered Jun 10 '11 at 15:24

Regexident

29,441
10
93
100

@Regexident lol I get it and I was aware of this part I just couldn't find a solution for 2 when making use of the PHP dom class. Ok, can you name something that I can use in PHP? – Abs Jun 10 '11 at 15:27
1

Yes, but don't **EVER** make generalisations like that! See [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1733489#1733489). – Lightness Races in Orbit Jun 10 '11 at 15:29
@tomalak-geretkal: Well, If you're using regex to just parse a very restricted subset of your HTML code, then you're fine with Regex. But then again: you're not dealing with HTML anymore. You're instead dealing with a type-3-compatible subset of HTML. Which is a totally different thing in respect of language theory. – Regexident Jun 10 '11 at 15:36
@Abs: try a HTML DOM-tree parser, such as libxml, or whatever you find for PHP. – Regexident Jun 10 '11 at 15:38
@Regexident: I know. Theory aside, in the real world where there are actual problems to solve in order to meet contracts, that's sometimes perfectly fine. – Lightness Races in Orbit Jun 10 '11 at 15:39
I disagree that it's *never* a solution. If you're doing something very, very limited, it can be useful. It can also be used in combination with something else. – Justin Morgan - On strike Jun 10 '11 at 15:44
@justin-morgan: To be able to securely use regex with HTML/XML/type-3 —even if "something very, very limited"—you'd basically have to introduce very strict limitations on what language patterns you allow in your HTML/type-3 code, after which you're basically not dealing with (real) HTML/type-3 anymore. And regarding your second point on using regex in "combination" with something else: This of course is totally fine. Most lexers use regex excessively. Lexers alone can't parse HTML though. Parsers can. Maybe I should have said "Regex (alone) is never a solution…". Point remains the same though. – Regexident Jun 11 '11 at 12:28

score 2 · Accepted Answer · answered Jun 10 '11 at 16:38

2

XPath can do both (1) and (2):

To test if there's a style tag in the body:

//body//style

To test if there's a div with a style attribute using width or background-image:

//div[contains(@style,'width:') or contains(@style,'background-image:')]

And, as you were curious about in your comments, seeing if a style tag contains a:hover or font-size:

//style[contains(text(),'a:hover') or contains(text(),'font-size:')]

answered Jun 10 '11 at 16:38

Austin Hyde

26,347
28
96
129

1

You're right, XPath can do what I need. Thanks Austin for reading my question and not getting sucked up into the argument of whether its best to use regex for HTML parsing or not!! – Abs Jun 11 '11 at 08:23
@Abs - The reason you don't want to use regex for HTML parsing is because it can't handle the recursiveness inherent in it. If you just wanted to see if there was a div on the page with a width style or if there was a style tag in the body, that would be OK, but to actually pull information out of the document would require a real parser, not regex. – Austin Hyde Jun 11 '11 at 15:47

score 0 · Answer 3 · answered Jun 10 '11 at 15:27

0

You can use Tidy to clean up the HTML, then parse it as XML. Then it's easy to use xpath to find nodes. Try something like this:

$tidyConfig = array(
    "add-xml-decl" => true,
    "output-xml" => true,
    "numeric-entities" => true
);
$tidy = new tidy();
$tidy->parseString($html, $tidyConfig, "utf8");
$tidy->cleanRepair();
$xml = new SimpleXMLElement($tidy);
$matches = $xml->xpath('style');

As for parsing a style attribute to look for specific selectors, I think you'll have to do that manually. You could use a CSS parser if you want.

answered Jun 10 '11 at 15:27

JW.

50,691
36
115
143

Tidy is not an option, I tried it and I find that it changes the original HTML and I need to keep the HTML as it is and parse it. So I need something that will allow me to parse invalid HTML. – Abs Jun 10 '11 at 15:30
What do you mean it changes the original? Any time you're parsing invalid HTML, there's going to be a chance that it interprets it differently than you intended. Not sure how you could avoid that. – JW. Jun 10 '11 at 15:35
`I find that it changes the original HTML` Yes, that's its job. What else were you expecting it to do? Have Tidy fix your input as a first step (your original data source will be unaffected) then continue. – Lightness Races in Orbit Jun 10 '11 at 15:41
@Abs: "parsing invalid HTML" is not possible. Parsing requires knowing what the input data looks like and what its structure and rules are. If your input is invalid (in an unpredictable way), then you have no grammar for it, and it cannot be parsed. It would be like trying to read a book in a language that you do not understand. – Lightness Races in Orbit Jun 10 '11 at 15:41
If you compare a invalid HTML document with that same invalid HTML document and you passed it through tidy. They will be two different documents. For example, if you have a style tag in the body, tidy will move it into the head section. These sort of changes I do not want to happen. – Abs Jun 10 '11 at 15:43

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

0

It's rarely a good idea to parse HTML with regex. However, any good HTML parser will be able to find all the divs with style tags, and regex could be useful for parsing the style attributes once you've done that.

It's still possible for complex (yet valid) CSS to break most regex, however, so the really durable thing here would be an HTML parser combined with a CSS parser. That could be overkill, though; a regex like \bwidth\s*:\s*(\w+) is likely to catch any width value unless someone's actively trying to fool it.

Edit:

A good HTML parser won't choke on anything that wouldn't choke a browser. I'm not a PHP guy anymore, but I've heard some good things about HTML Purifier.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 10 '11 at 15:33

Justin Morgan - On strike

30,035
12
80
104

to be honest, all I need to do is to find the presence of elements and if they have certain attributes. This should be fine for regex right. In fact, I will be using the HTML as a string. – Abs Jun 10 '11 at 15:36
@Abs - Regex won't work for the whole HTML string. At least, it won't be even *close* to reliable, not without a ton of work. However, I'd go ahead and use the method I mentioned in the first paragraph: Use a DOM parser to grab the `style` attribute out of each div, then run regex on that. Believe me, it will be easier than doing the whole thing with regex. – Justin Morgan - On strike Jun 10 '11 at 15:39
@Abs - Edited to suggest a good HTML parser, if yours isn't cutting it. – Justin Morgan - On strike Jun 10 '11 at 15:49

Parsing HTML - is regex the only option in this case?

4 Answers4

Edit: