Best way to parse an invalid HTML in PHP

Question

Is there a better approach to parse an invalid HTML then applying Tidy on it?

Side Note : There are some situation when you can't have Tidy available. Regexp is also not recommended I understood for parsing html.

In situations where you don't have Tidy available, you should install it. Or you could just not use broken HTML in the first place. — Matti Virkkunen, Aug 31 '10 at 07:17
Are you serious ? There are at least a couple of times I was unable to do this best practice : invalid html code from clients that need it to be parsed, shared hosting with no option to install Tidy.. — johnlemon, Aug 31 '10 at 07:20
possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) — Gordon, Aug 31 '10 at 07:25

score 7 · Accepted Answer · answered Aug 31 '10 at 07:18

7

I would try something like this: http://php.net/manual/en/domdocument.loadhtml.php

From that page:

The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.

answered Aug 31 '10 at 07:18

Rob

1,796
14
15

it seems loadHTML objects to the same value for ID on two or more elements (although, this is probably coming up from libxml) – HorusKol Jan 06 '15 at 04:22

score 1 · Answer 2 · answered Aug 31 '10 at 07:19

1

SimpleHTMLDOM is known to be more lenient than PHP's native DOM functions.

answered Aug 31 '10 at 07:19

Pekka

442,112
142
972
1,088

2

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Aug 31 '10 at 07:24
@Gordon this time you were too quick :) He is looking to parse broken HTML. – Pekka Aug 31 '10 at 07:25
1

which all DOM based parsers should be able to handle fine when using [libxml's HTML parser module](http://xmlsoft.org/html/libxml-HTMLparser.html). – Gordon Aug 31 '10 at 07:26

Best way to parse an invalid HTML in PHP

2 Answers2

Linked

Related