16

Is there a better approach to parse an invalid HTML then applying Tidy on it?

Side Note : There are some situation when you can't have Tidy available. Regexp is also not recommended I understood for parsing html.

johnlemon
  • 20,761
  • 42
  • 119
  • 178
  • 2
    In situations where you don't have Tidy available, you should install it. Or you could just not use broken HTML in the first place. – Matti Virkkunen Aug 31 '10 at 07:17
  • 3
    Are you serious ? There are at least a couple of times I was unable to do this best practice : invalid html code from clients that need it to be parsed, shared hosting with no option to install Tidy.. – johnlemon Aug 31 '10 at 07:20
  • possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) – Gordon Aug 31 '10 at 07:25

2 Answers2

7

I would try something like this: http://php.net/manual/en/domdocument.loadhtml.php

From that page:

The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.

Rob
  • 1,796
  • 14
  • 15
  • it seems loadHTML objects to the same value for ID on two or more elements (although, this is probably coming up from libxml) – HorusKol Jan 06 '15 at 04:22
1

SimpleHTMLDOM is known to be more lenient than PHP's native DOM functions.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 2
    Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Aug 31 '10 at 07:24
  • @Gordon this time you were too quick :) He is looking to parse broken HTML. – Pekka Aug 31 '10 at 07:25
  • 1
    which all DOM based parsers should be able to handle fine when using [libxml's HTML parser module](http://xmlsoft.org/html/libxml-HTMLparser.html). – Gordon Aug 31 '10 at 07:26