Parsing of badly formatted HTML in PHP

Question

In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create. The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.

The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.

Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?

score 9 · Accepted Answer · answered Feb 28 '10 at 15:40

9

A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant

An alternative idea might be to try loading your HTML with DOMDocument::loadHTML (quoting) :

The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load.

And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.

answered Feb 28 '10 at 15:40

Pascal MARTIN

395,085
80
655
663

+1 for introduction htmlpurifier. one may look at http://simplehtmldom.sourceforge.net/ too. – Alexar Feb 28 '10 at 16:43
The purifier is nice, but feels like kinda overkill for the problem. Same thing goes for the DOMParser. Is it not correct, that it will require a lot more time and ram than a simple sax parser? – Thomas Ahle Mar 04 '10 at 22:16
Maybe it will require more RAM, and possibly time ; but it will do more than a simple SAX parse, that would only read data, and not repair it ;;; and I'd say a SAX parser will only be able to read valid XML -- while HTMLPurifier and `DOMDocument::loadHTML` are both able to read "broken" HTML. – Pascal MARTIN Mar 04 '10 at 23:07
Because my errors are always generated by the same engine, and thus fairly predictable, I've coded the parser using simple regex. I know about http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 and I am very thankful for pointing me to these two great tools. – Thomas Ahle Apr 04 '10 at 10:06
If you can "predict" the errors, I guess that's OK :-) You're welcome :-) – Pascal MARTIN Apr 04 '10 at 10:18

Gordon · Answer 2 · 2010-02-28T16:21:22.127

4

There is SimpleHTML

For repairing broken HTML, you could use Tidy.

As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.

See http://www.ibm.com/developerworks/library/x-pullparsingphp.html

edited Feb 28 '10 at 16:21

answered Feb 28 '10 at 15:40

Gordon

312,688
75
539
559

1

+1 for Tidy. I find it's more robust at it's job than SimpleHTML. 2 separate tools for 2 different jobs really. – HappyTimeGopher Jun 12 '12 at 14:12

score 1 · Answer 3 · answered Feb 28 '10 at 16:27

Any particular reason you're still using the PHP 4 XML API?

If you can get away with using PHP 5's XML API, there are two possibilities.

First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.

Second option - you could try the HTML parser based on the HTML5 parser specification:

http://code.google.com/p/html5lib/

This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.

I'd rather not use a dom parser, as the document is quite big. (And I've already written tons of code for the sax) — Thomas Ahle, Mar 04 '10 at 23:26

score 0 · Answer 4 · answered Jan 11 '17 at 10:34

A solution is to use DOMDocument.

Example :

$str = "
<html>
 <head>
  <title>test</title>
 </head>
 <body>
  </div>error.
  <p>another error</i>
 </body>
</html>
";

$doc = new DOMDocument();
@$doc->loadHTML($str);
echo $doc->saveHTML();

Advantage : natively included in PHP, contrary to PHP Tidy.

Parsing of badly formatted HTML in PHP

4 Answers4

Linked