2

I tried several methods to find out what part of a html string is invalid

$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);

None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?

I need this to manually fix html input from users. I don't want to relay on automated processes.

johnlemon
  • 20,761
  • 42
  • 119
  • 178

2 Answers2

3

I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.

error_reporting(0);

$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';

$doc = new DOMDocument();
$doc->encoding = 'UTF-8';

$doc->loadHTML($badHTML);

$goodHTML = simplexml_import_dom($doc)->asXML();
Nev Stokes
  • 9,051
  • 5
  • 42
  • 44
1

You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.

Community
  • 1
  • 1
jcubic
  • 61,973
  • 54
  • 229
  • 402