0

So my brother and I decided to parse xml content from a website using CURL and Dom.

I keep on getting a blank return value when I try to echo various aspect of the dom parts.

Here are some details:

  1. An example website url we are CURLing and using Dom for is like this: https://event.on24.com/eventRegistration/EventServlet?eventid=2062141&sessionid=1&key=FD3181776AA1D3051A0CE6249F1A391A&filter=eventsessionmediapresentationlogplayerxmlformateventrootmediabaseurldialininfomobileenvondemandexcludequestionexcludemessagesexcludeslides
  2. Notice the URL is not the direct path to an XML file. But on that page it has XML content. Try to click on the link, you'll see what I mean.
  3. I am wanting to print the content between the tags.
  4. The way I am using the CURL and Dom scripts are either not right or something else is wrong.

I've tried various echos in different areas of my code but all have returned a blank value. When I try to echo $parsedcontent it comes up with a blank.

When I try to echo "Hello World" after the "Foreach... 'span' as..." it doesn't print anything.

$urlcontent = $event['url']; 
$chcontent = curl_init();
$timeoutcontent = 5;
curl_setopt($chcontent, CURLOPT_URL, $urlcontent);
curl_setopt($chcontent, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($chcontent, CURLOPT_CONNECTTIMEOUT, $timeoutcontent);
curl_setopt($chcontent, CURLOPT_SSL_VERIFYPEER, false);
$htmlcontent = curl_exec($chcontent);
$infocontent = curl_getinfo($chcontent);
curl_close($chcontent);

@$domcontent->loadXML($htmlcontent);

foreach($domcontent->getElementsByTagName('span') as $spanon24content) {
    # Get url and title from <a> tags
    $innerHTMLspan = ''; 
    $childrenspan  = $spanon24content->childNodes;

    foreach ($childrenspan as $childspan) { 
        $innerHTMLspan .= $divspanon24content->ownerDocument->saveXML($childspan);
    }
}
$parsedcontent = $innerHTMLspan;

echo $parsedcontent;
cOle2
  • 4,725
  • 1
  • 24
  • 26
  • I think the answer on this question might point you in the right direction: https://stackoverflow.com/questions/6674322/how-to-get-values-inside-cdatavalues-using-php-dom – cOle2 Aug 07 '19 at 21:26
  • Possible duplicate of [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – hanshenrik Aug 07 '19 at 23:55
  • `I keep on getting a blank return value when I try to echo various aspect of the dom parts.` - when DEBUGGING, use var_dump(), not echo(), to avoid this issue. also make sure that php.ini has `error_reporting=E_ALL` and `display_error=on` (or alternatively, make sure the error log works, and read the error log after running your code) – hanshenrik Aug 07 '19 at 23:57
  • `Try to click on the link, you'll see what I mean.` what do you mean **the link** ? your test XML page has 80 different links! which of the 80 links do you mean? – hanshenrik Aug 08 '19 at 07:54
  • `I am wanting to print the content between the tags.` which tags are you talking about, it has 3679 tags, do you want the content between *all* of them? – hanshenrik Aug 08 '19 at 08:00

1 Answers1

1

The span is inside an HTML Fragment stored as a text node in the outer XML. For the XML this is just text. You need to load (and parse) it into a separate DOM document.

$xml = <<<'XML'
<events>
  <eventkey>valid</eventkey>
  <nowdate>1565257004221</nowdate>
  <event>
    <eventAbstract><![CDATA[<p><span style="font-size:16px;">Scaling automation in your security environment can involve unnecessary time to clean up task completion notes as more incidents fly in.</span></p>

<p><span style="font-size:16px;">Join Gerald Trotman, CTP for IBM Resilient, in this tech session to learn how Resilient Task Helper Functions can help clean and consolidate notes to improve visibility into completed tasks and ultimately cut down the&nbsp;time to respond for your security team.</span></p>]]>
    </eventAbstract>
  </event>
</events>
XML;

$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMxpath($document);

foreach ($xpath->evaluate('//eventAbstract') as $abstractNode) {
    // load the node content as HTML
    $htmlDocument = new DOMDocument();
    $htmlDocument->loadHTML($abstractNode->textContent);
    $htmlXpath = new DOMXpath($htmlDocument);

    // just read text content
    $innerText = $htmlDocument->textContent;

    // build up a (x)html fragment
    $innerHTML = '';
    foreach ($htmlXpath->evaluate('//span/node()') as $spanChildNode) {
        $innerHTML .= $htmlDocument->saveXML($spanChildNode);
    } 
    var_dump($innerText, $innerHTML);
} 
ThW
  • 19,120
  • 3
  • 22
  • 44