3

Today when I was parsing one page with Simple HTML DOM parser I didn't get any result. So I thought, that it must be strange. So I went to see HTML code written there. I found that there's many mistakes.

So here comes the question. What to do in state, when parser works correctly, but HTML is a mess. Maybe some one would suggest some aproach or some other parser which is able to handle, that sort of matters.

Thank you all for help.

Eugene
  • 4,352
  • 8
  • 55
  • 79
  • possible duplicate of [How do I parse partial HTML?](http://stackoverflow.com/questions/1933631/how-do-i-parse-partial-html) – Pekka Apr 06 '11 at 09:19
  • 1
    possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) – Gordon Apr 06 '11 at 09:22
  • both above are incorrect. it is not partial HTML, but broken HTML and he is already using the "best option" from that second link. duplicate would be something like http://stackoverflow.com/questions/2351526/parsing-of-badly-formated-html-in-php – dogmatic69 Apr 06 '11 at 09:35
  • It also heavily depends on why are you parsing this HTML and whether you have control over the source – the answer might be tidy, simpledom, even regexp might be the right tool in few cases. – Adam Kiss Apr 06 '11 at 09:38
  • @dogmatic The answer I linked is specifically about parsing HTML (which implies broken because HTML is broken by design). OP asked for alternatives. DOM can parse broken HTML fine. And SimpleHTMLDom is the worst solution for an HTML parser ever. So the options given in my answer should solve the OP's question, hence, it's a duplicate. – Gordon Apr 06 '11 at 09:53
  • @Gordon what tool instead of DOM would you suggest from 3rd party lib list? Also it is important, that HTML is incorrect on page I'm trying to parse. – Eugene Apr 06 '11 at 11:40
  • @Eugene see my comments below http://stackoverflow.com/questions/3577641/best-methods-to-parse-html. It depends on your needs. I'm perfectly fine with DOM (!= SimpleHtmlDom). A lot of people like phpQuery because they know jQuery. – Gordon Apr 06 '11 at 11:46
  • @Gordon what would say about Zend_Dom? I heard that it uses native PHP functionality just in more comfortable way. And question about DOM. If there's some open tag without closing, SimpleHTMLDom parser can't find the right part. For example:
    or without or in other words broken. How would DOM work with that?
    – Eugene Apr 06 '11 at 17:20
  • @Eugene I never used it. It has CSS selectors if you need that. When you try to read broken HTML with DOM's loadHTML, it will switch to the HTML parser module and that would try to fix the HTML for you. – Gordon Apr 06 '11 at 17:24
  • @Gordon. Oh that's great. Then I will definitely try that. Thank you very much. – Eugene Apr 06 '11 at 17:25
  • @Gordon. Nah. In my case it didn't help. :( – Eugene Apr 06 '11 at 18:40
  • @Eugene you might be doing something wrong then. I've yet to come across broken HTML DOM cannot parse. For instance `

    foo

    ` will get transformed to `

    foo

    `
    – Gordon Apr 06 '11 at 19:21

2 Answers2

2

Run it through tidy before trying to load it into a DOM tree, http://php.net/manual/en/book.tidy.php

David Gillen
  • 1,172
  • 5
  • 14
0

Seems like php's built in stuff should work fine for the html that is not so well written. Have a read in the comments as some people have info about it.

http://docs.php.net/manual/en/domdocument.loadhtml.php

dogmatic69
  • 7,574
  • 4
  • 31
  • 49