3

I want to manipulate HTML and XHTML documents with the PHP DOM implementation. I use the DOMDocument->loadHTML() method to load the content.

In want to know if the loaded content is either XHTML or HTML. DOMDocument has a doctype object which contains the DOCTYPE declaration from the document itself. So far I thought about comparing $dom->doctype->publicId which contains strings like "-//W3C//DTD HTML 4.01//ENtext/html"

Is there any better way anyone can think of?

Edit:

Sorry if my question was a bit unclear. I updated the question since it might have been confusing. But to make it clear now: This question is not about handling HTML with PHP DOM in general or whether XHTML is good or bad.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
Alex Lawrence
  • 1,160
  • 3
  • 10
  • 19
  • Why not just fix the source documents rather than incur extra server overhead? – Demian Brecht Jan 05 '11 at 23:16
  • What do you mean by fix? I never said they are broken. The source documents are everything provided by a user. So there could be a valid DOCTYPE declaration. It could also be missing. I am actually just curious if anyone knows another or a better way to say if it is XHTML or HTML than using the DOMDocument->doctype. – Alex Lawrence Jan 05 '11 at 23:21
  • pretty sure it you load as html, you should save as html. it should maintain the original document type declaration. you can use the DOM validate method to determine if the document is valid based on its document type declaration. you should have the user fix the code if it is invalid. – dqhendricks Jan 06 '11 at 01:52
  • dqhendricks, your comment is not helpful at all. "pretty sure it you load as html, you should save as html" might sound correct in general but not in the case of PHP DOM. If you want to deal with invalid markup you have to use the loadHTML() method. I wasn´t even asking about validation. And whether the user has to fix his code if its invalid is completely out of scope. This decision is a business requirement, not a technical. – Alex Lawrence Jan 06 '11 at 09:37

1 Answers1

1

If you're loading from an external source, you can check the file's MIME type and see if it's application/xhtml+xml; if it is, it's most definitely XHTML (of course it can lie and serve with that type, but with horribly malformed markup). Otherwise if it's text/html then it'll be parsed as HTML tag soup. Validity of the actual markup aside, the doctype declaration is your next best way of telling whether the content is (or claims to be) HTML or XHTML.

Like you say, you can check the public identifier and/or the URI and determine the type from there.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
  • Okay. So my test for XHTML is now: "strpos(strtolower($dom->doctype->publicId), 'xhtml') !== false". If this is not the case then I assume it is HTML. What do you think? – Alex Lawrence Jan 06 '11 at 12:14
  • @Alex: That sounds alright, since browsers most often receive pages as `text/html` anyway, so that's a reasonable assumption. You can use `stripos()` instead of `strpos(strtolower())`. – BoltClock Jan 06 '11 at 12:16