3

Hello good day I am trying to scrape an xml feed that was given to us, I am using simple htmldom to scrape it but some contents have cdata, how can I remove it?

<date>
<weekday>
<![CDATA[ Friday
]]> 
</weekday>
</date>

php

<?php     
<?php 
include('simple_html_dom.php'); 
include ('phpQuery.php'); 
if (ini_get('allow_url_fopen'))
$xml  = file_get_html('http://www.link.com/url.xml'); }
else{       $ch = curl_init('http://www.link.com/url.xml');
curl_setopt  ($ch, CURLOPT_HEADER, false);        
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);           
$src = curl_exec($ch);           
$xml = str_get_html($src, false);  }   
?>
<?php 
foreach($xml->find('weekday') as $e)
echo $e->innertext  . '<br>';
?>

I believe by default simplehtmldom removes the cdata but for some reason it doesn't work.

Kindly tell me if you need any info that would be helpful to solve this issue

Thank you so much for your help

cooldude
  • 101
  • 2
  • 15

2 Answers2

3

You can make use of another xml parser that is able to convert cdata into a string (Demo):

$innerText = '<![CDATA[ Friday
]]>';

$innerText = (string) simplexml_load_string("<x>$innerText</x>"));

Extended code-example based on OP's code

# [...]
<?php 
foreach($xml->find('weekday') as $e)
{
    $innerText = $e->innertext;
    $innerText = (string) simplexml_load_string("<x>$innerText</x>");
    echo $innerText . '<br>';
}
?>

Usage instructions: Locate the line which contains the foreach and then compare the original code with the new code (only the foreach in question has been replaced).

hakre
  • 193,403
  • 52
  • 435
  • 836
  • It doesnt seem to work, the day which is friday is dynamic the xml is a weather xml feed, I was able to scrape everything using simplehtmldom except the ones with cdata. thank you for the info I'll play around with other xml parser just like you said =) – cooldude Sep 23 '11 at 20:09
  • 1
    Just use that one-liner on your variable: `$e->innertext`. No need to change the complete library if you need a quick fix. Don't forget to report your problem to the library author. – hakre Sep 23 '11 at 20:11
  • What do you mean replace $e->innertext with $innerText = '<![CDATA[ Friday ]]>'; – cooldude Sep 23 '11 at 20:17
  • I can not imagine this little merely a one-liner example is too hard too grasp. What I can specifically say is: You can use that line of code to convert your cdata text you store into `$e->innertext` into a non-cdata text. Is the message clear? I will extend the example so it's failsafe. – hakre Sep 24 '11 at 13:12
  • appreciate your help hakre sorry for the late reply. Thank you so much – cooldude Sep 26 '11 at 12:14
2

I agree with the other answer - just allow CDATA to be shown. I'd recommend simpleXML

$xml = simplexml_load_file('test.xml', 'SimpleXMLElement', LIBXML_NOCDATA);
echo '<pre>', print_r($xml), '</pre>';

LIBXML_NOCDATA is important - keep that in there.

mikevoermans
  • 3,967
  • 22
  • 27