I am using php to parse xml
response of an API. Here is a sample response -
$xml = '<?xml version="1.0"?>
<q:response xmlns:q="http://api-url">
<q:impression>
<q:content>
<html>
<meta name="HandheldFriendly" content="True">
<meta name="viewport" content="width=device-width, user-scalable=no">
<meta http-equiv="cleartype" content="on">
</head>
<body style="margin:0px;padding:0px;">
<iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
</body>
</html>
</q:content>
<q:cpc>0.02</q:cpc>
</q:impression>
</q:response>';
Note the following points -
The response has some invalid markup like this -
<head>
tag start inside<html>
is not there but it is closed.<meta>
tags inside<html>
are not closed.- The iframe's
src
attribute contains a URL with multiple params separated by&
. So, this and any other possible URLs need to be urlencoded before the$dom->loadXML();
(see my code below).
Requirement
- I need to read whatever is there inside the
<q:content></q:content>
tags. - I need to parse invalid markup (as I am getting) and properly read the content.
- url's need to be encoded for the characters as listed in What characters do I need to escape in XML documents?. This needs to be done with the current logic I am following.
Current code
So, far I have this code which works fine if the contents inside the <q:content></q:content>
tags is valid markup -
$dom = new DOMDocument;
$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid
$adHtml = "";
foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element)
{
if($element->localName == "content")
{
$children = $element->childNodes;
foreach ($children as $child)
{
$adHtml .= $child->ownerDocument->saveXML($child);
}
}
}
echo $adHtml; //Have got necessary contents here
Check working code here (with valid markup and single param in iframe src).
What I am thinking now
Now, going with the solution given by @hakre in my previous question -
I tried with
DOMDocument::loadHTML()
and it fails as I expected. Gives warnings like -Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2
escape a specific part of the string for characters listed in What characters do I need to escape in XML documents?.
Question
Finally, if I have to "escape a specific part of the string" (in my case look for whatever is there in between the <q:content></q:content>
) as given in that answer to urlencode whatever is there, then why shouldn't I look for the those delimiters (<q:content></q:content>
) in the first place and return that? Then what is the benefit of using DOMDocument::loadXML()
in such cases? I guess this is a pretty common case...
So, my question is given this Requirement and the points given under Note the following points -, what is the most clever way to proceed?