1

I am retreiving some html strings from my database and I would like to parse these strings into my DOMDocument. The problem is, that the DOMDocument gives warnings at special characters.

Warning: DOMDocumentFragment::appendXML() [domdocumentfragment.appendxml]: Entity: line 2: parser error : Entity 'nbsp' not defined in page.php on line 189

I wonder why and I wonder how to solve this. This are some code fragments of my page. How can I fix these kind of warnings?

$doc = new DOMDocument();

// .. create some elements first, like some divs and a h1 ..

while($row = mysql_fetch_array($result))
{
    $messageEl = $doc->createDocumentFragment();
    $messageEl->appendXML($row['message']); // gives it's warnings here!

    $otherElement->appendChild($messageEl);
}

echo $doc->saveHTML();

I also found something about validation, but when I apply that, my page won't load anymore. The code I tried for that was something like this.

$implementation = new DOMImplementation();
$dtd = $implementation->createDocumentType('html','-//W3C//DTD XHTML 1.0 Transitional//EN','http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd');

$doc = $implementation->createDocument('','',$dtd);
$doc->validateOnParse = true;
$doc->formatOutput = true;

// in the same whileloop, I used the following:
$messageEl = $doc->createDocumentFragment();
$doc->validate(); // which stopped my code, but error- and warningless.
$messageEl->appendXml($row['message']);

Thanks in advance!

Marnix
  • 6,384
  • 4
  • 43
  • 78
  • What does `$row['message']` contain, exactly? – Tomalak Jan 10 '11 at 10:33
  • It contains a piece of html with most of the time just a

    Stuff here

    . But it can always contain more elements as well.
    – Marnix Jan 10 '11 at 10:35
  • Also, why are you building an XML document in memory just to do `echo $doc->saveHTML();` at the end? This makes no sense. You could just echo the HTML to the page without all the XML voodoo, or couldn't you? – Tomalak Jan 10 '11 at 10:48
  • 2
    I would like to do this, because I really like OO programming. Printing the tags manually gives me no structure to my code at all. I want to be sure that some things are printed first and I like to keep the overview of this process. – Marnix Jan 10 '11 at 11:29

5 Answers5

8

There is no   in XML. The only character entities that have an actual name defined (instead of using a numeric reference) are &, <, >, " and '.

That means you have to use the numeric equivalent of a non-breaking space, which is   or (in hex)  .

If you are trying to save HTML into an XML container, then save it as text. HTML and XML may look similar but they are very distinct. appendXML() expects well-formed XML as an argument. Use the nodeValue property instead, it will XML-encode your HTML string without any warnings.

// document fragment is completely unnecessary
$otherElement->nodeValue = $row['message'];
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • So I should first parse every string and map them to some equivalents, so the xml parser can map them back again? Is there a function for this in PHP? – Marnix Jan 10 '11 at 10:32
  • @Marnix: No, of course not. There is no need to modify your input string to make it work with XML, you are just using the wrong function. See edited answer. – Tomalak Jan 10 '11 at 10:40
  • This isn't working. The nodeValue is printing the tags as well. So my output now contains '

    Stuff here

    '. The

    is transformed into text, instead of a p-tag.

    – Marnix Jan 10 '11 at 11:09
  • 1
    @Marnix: As I said, XML and HTML are not the same thing. You can't intermix them, unless you use XHTML (and I suppose your database contents is not valid XHTML). If you want to do templating (and that's what it seems like) then use a templating engine like Smarty to base your page on, not an XML document. General tip: If it is overly difficult to do something simple, you might be using the wrong tools. – Tomalak Jan 10 '11 at 11:59
  • I accept the comment of smarty, not the actual answer. Smarty does indeed do nice work with printing stuff. The templates make the code more readable, which I was looking for. – Marnix Jan 10 '11 at 15:07
  • @Marnix: My comment on Smarty may solve your actual problem, but my answer is correct for the question you've asked. The thing is, you did not ask the right question. ;-) – Tomalak Jan 10 '11 at 16:02
5

That's a tricky one because it's actually multiple issues in one.

Like Tomalak points out, there is no   in XML. So you did the right thing specifying a DOMImplementation, because in XHTML there is  . But, for DOM to know that the document is XHTML, you have load and validate against the DTD. The DTD is located at

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

but because there is millions of requests to that page daily, the W3C decided to block access to the page, unless there is a UserAgent sent in the request. To supply a UserAgent you have to create a custom stream context.

In code:

// make sure DOM passes a User Agent when it fetches the DTD
libxml_set_streams_context(
    stream_context_create(
        array(
            'http' => array(
                'user_agent' => 'PHP libxml agent',
            )
        )
    )
);

// specify the implementation
$imp = new DOMImplementation;

// create a DTD (here: for XHTML)
$dtd = $imp->createDocumentType(
    'html',
    '-//W3C//DTD XHTML 1.0 Transitional//EN',
    'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
);

// then create a DOMDocument with the configured DTD
$dom = $imp->createDocument(NULL, "html", $dtd);
$dom->encoding = 'UTF-8';
$dom->validate();

$fragment = $dom->createDocumentFragment();
$fragment->appendXML('
    <head><title>XHTML test</title></head>
    <body><p>Some text with a &nbsp; entity</p></body>
    '
);
$dom->documentElement->appendChild($fragment);
$dom->formatOutput = TRUE;
echo $dom->saveXml();

This still takes some time to complete (dont ask me why) but in the end, you'll get (reformatted for SO)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>XHTML test</title>
    </head>
    <body>
        <p>Some text with a &nbsp; entity</p>
    </body>
</html>

Also see DOMDocument::validate() problem

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • This requires that the database content is a valid XHTML snippet to begin with, plus it solves a problem that the OP is in no need of having (so to speak). He is trying to re-invent an HTML templating engine on the basis of XML documents, which is an unnecessarily painful approach to a problem that has already been solved otherwise. If I understood it correctly, he wants to use XML because he wants to use XML - a weak reason, IMHO. Anyway, +1 for the effort. – Tomalak Jan 10 '11 at 13:26
  • I don't actually want to use XML, but just use a DOMDocumentFragment.appendHTML(), which doesn't exist. +1 for the code, but I won't use it. The Smarty worked for me! – Marnix Jan 10 '11 at 15:05
0

I do see the problem in question, and also that the question has been answered, but if I may I'd like to suggest a thought from my past dealing with similar problems.

It just might be so that your task requires including tagged data from the database in the resulting XML, but may or may not require parsing. If it's merely data for inclusion, and not structured parts of your XML, you can place strings from the database in CDATA section(s), effectively bypassing all validation errors at this stage.

Dennis Kreminsky
  • 2,117
  • 15
  • 23
0

Here's another approach, because we did not want possibly slow network requests (or any network requests at all resulting from user input):

<?php
$document = new \DOMDocument();
$document->loadHTML('<html><body></body></html>');

$html = '<b>test&nbsp;</b>';
$fragment = $document->createDocumentFragment();

$html = '<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document [
<!ENTITY nbsp   "&#160;" >
]>
<document>'.$html.'</document>';

$newdom = new \DOMDocument();
$newdom->loadXML($html, LIBXML_HTML_NOIMPLIED | LIBXML_NOCDATA | LIBXML_NOENT | LIBXML_NONET | LIBXML_NOBLANKS);

foreach ($newdom->documentElement->childNodes as $childnode)
  $fragment->appendChild($fragment->ownerDocument->importNode($childnode, TRUE));

$document->getElementsByTagName('body')[0]->appendChild($fragment);

echo $document->saveHTML();

Here we include the relevant part of the DTD, specifically the latin1 entity definitions as an internal DOCTYPE definition. Then the HTML content is wrapped in a document element to be able to process a sequence of child elements. The parsed nodes are then imported and added into the target DOM.

Our actual implementation uses file_get_contents to load the DTD containing all entity definitions from a local file.

Ivo Smits
  • 237
  • 1
  • 10
-1

While smarty might be a good bet (why invent the wheel for the 14th time?), etranger might have a point. There's situations in which you don't want to use something overkill like a complete new (and unstudied) package, but more like you want to post some data from a database that just happens to contain html stuff an XML parser has issues with.

Warning, the following is a simple solution, but don't do it unless you're SURE you can get away with it! (I did this when I had about 2 hours before a deadline and didn't have time to study, leave lone implement something like smarty...)

Before sticking the string into an appendXML function, run it through a preg_replace. For instance, replace all & nbsp; characters with [some_prefix]_nbsp. Then, on the page where you show the html, do it the other way around.

And Presto! =)

Example code: Code that puts text into a document fragment:

// add text tag to p tag.
// print("CCMSSelTextBody::getDOMObject: strText: ".$this->m_strText."<br>\n");
$this->m_strText = preg_replace("/&nbsp;/", "__nbsp__", $this->m_strText);
$domTextFragment = $domDoc->createDocumentFragment();
$domTextFragment->appendXML(utf8_encode($this->m_strText));
$p->appendChild($domTextFragment);
// $p->appendChild(new DOMText(utf8_encode($this->m_strText)));

Code that parsed the string and writes the html:

// Instantiate template.
$pTemplate = new CTemplate($env, $pageID, $pUser, $strState);

// Parse tag-sets.
$pTemplate->parseTXTTags();
$pTemplate->parseCMSTags();

// present the html code.
$html = $pTemplate->getPageHTML();
$html = preg_replace("/__nbsp__/", "&nbsp;", $html);
print($html);

It's probably a good idea to think up a stronger replace. (If you insist on being thorough: Do a md5 on a time() value, and hardcode the result of that as a prefix. So like in the first snippet:

$this->m_strText = preg_replace("/&nbsp;/", "4597ee308cd90d78aa4655e76bf46ee0_nbsp", $this->m_strText);

And in the second:

$html = preg_replace("/4597ee308cd90d78aa4655e76bf46ee0_nbsp/", "&nbsp;", $html);

Do the same for any other tags and stuff you need to circumvent.

This is a hack, and not good code by any stretch of the imagination. But it saved my live and wanted to share it with other people that run into this particular problem with minutes to spare.

Use the above at your own risk.