How to parse text and image from complex xml

Question

I hope you can help me with that. The XML file looks like this:

<channel><item>
<description>
<div>  <a href="http://image.com">
<span>   
<img src="http://image.com" /> 
</span>
</a>
Lorem Ipsum is simply dummy text of the printing etc... 
</div>
</description>
</item></channel>

I can get the contents of the description tag, but when i do that, i get the whole structure which has lots of css in there and i don't want that. What i really need is to parse the href link and the Lorem Ipsum text only. I'm trying with simple XML, but can't find out, looks too complicated. Any ideas?

edit: code i use to parse xml

$file = new SimpleXMLElement($mydata);
{

    foreach($file->channel->item as $post)
{

    echo $post->description; } }

I also tried to get the attributes using 'attributes()' , but there's no way i can do that. Description tag has no attributes, but more tags inside , like div, a and img. I can't just get the attributes from 'a' and 'img' tags with simple xml. — pano, Jan 13 '13 at 12:33

score 1 · Answer 1 · edited May 23 '17 at 12:27

That XML looks very much like an RSS or Atom feed (or an extract from one). The description node would commonly be escaped, or placed inside a section marked <![CDATA[ ... ]]>, which indicates that its contents are to be treated as raw text, even if they contain <, >, or &.

Your sample doesn't indicate that, but if your echo is giving you the whole content, including img tags etc, then that is what is happening, and your question is similar to Trying to Parse Only the Images from an RSS Feed - you need to grab the whole description content, and parse it as a document of its own.

If for some reason the HTML is not being escaped, and is actually being included as a bunch of child nodes inside the XML, then the linked URL can be accessed directly (assuming the structure is always consistent):

echo (string)$post->description->div->a['href'];

As for the text, SimpleXML will concatenate all text content of a particular element (but not from within its children) if you "cast to string" with (string) (echo automatically casts to string, but I'm guessing you'll want to do something other than echo with it eventually).

In your example, the text you want is inside the first (and only) div, so this would display it:

echo (string)$post->description->div;

However, you mention "lots of CSS", which I guess you've left out of your example for simplicity, so I'm not sure how consistent your real content is.

yes, there are many style attributes in there. There are many span tags as well in the text, and some divs too, which i believe might cause some problems.I saw your other post, http://stackoverflow.com/questions/14246656/trying-to-parse-only-the-images-from-an-rss-feed?lq=1 and it seems to work for me, as it gets all the images links. As for the text, it does give me some results (with a little bit of modification) but the problem is i can't read the results due to wrong encoding or something. I wrote this `new DOMDocument('1.0','UTF-8')` but didn't work. what i get is something like ÎÎµÎ³Î¬Î. — pano, Jan 13 '13 at 22:38
It works fine for images and probably does for the text too. Still encoding is a problem(text is in Greek). I think is caused when i try to get the `$description_dom`. i'll post final code. — pano, Jan 13 '13 at 23:39

Nemo64 · Answer 2 · 2013-01-13T18:58:27.350

That's going to be complicated. ~~You don't have XML there but html. One difference is that a tag can't contain another tag AND some text in XML. That's why~~ I'd use the DOM of PHP (which I haven't used yet but is similar to pure JavaScript).

This is what I have hacked together (untested):

// first create our document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML("your html here"); // there is also a loadHTMLFile

// this tries to get an a element which has a href and returns that href
function getAHref ( $doc ) {
    // now get all a elements to get the one with a href
    $aElements = $doc->getElementsByTagName( "a" );
    foreach ( $aElements as $a ) {
        // has this element a href? than return
        if ( $a->hasAttribute( "href" ) ) {
            return $a->getAttribute( "href" );
        }
    }
    // failed? return false
    return false;
}

// tires to get the text in the node
// in your example the text isn't wrapped in anything so this is going to be difficult
function getTextFromNode ( $doc ) {
    // get and loop all divs (assuming the text is always a child of a div)
    $divs = $doc->getElementsByTagName( "div" ); // do we know it's always in that div?
    foreach ( $divs as $div ) {
        // also loop all child nodes to get the text nodes
        foreach ( $div->childNodes as $child ) {
            // is this a text node?
            if ( $child->nodeType == XML_TEXT_NODE ) {
                // is there something in it (new lines count as text nodes)
                if ( trim( $child->nodeValue ) != "" ) {
                    // *pfew* got it
                    return $child->nodeValue;
                }
            }
        }
    }
    // failed? return false
    return false;
}

Thanks for your time. I used your script in both the example above and the actual xml file but i don't get any results. Instead i get an error which says "Cannot redeclare getText()", on last line. — pano, Jan 13 '13 at 11:39
@pano The error message says just whats wrong. PHP has a method build-in that's called getText which I didn't know. — Nemo64, Jan 13 '13 at 13:53
"a tag can't contain another tag AND some text in XML"? `bar bob` is perfectly valid XML. — IMSoP, Jan 13 '13 at 17:43
@IMSoP You sure? As far as I know many XML parser fail if you parse html because of that reason. Last time I checked even an XMLHttpRequest could not parse html (if you access responseXML; responseText works just fine). Or am I missing something? — Nemo64, Jan 13 '13 at 18:39
@Nemo64 Try it yourself, in whatever XML parser you want, e.g. http://codepad.org/W7AAa1Vh The most common thing that makes (valid) HTML invalid XML is standalone tags, like `
`, since in XML **every** tag must be closed, either as `
` or the short-hand `
`. There are other differences - a fragment of HTML often won't constitute a "document" (no single overall containing element), some HTML attributes don't need a value (``, ` — IMSoP, Jan 13 '13 at 18:48

score 0 · Accepted Answer · answered Jan 14 '13 at 20:35

This is the final code that answears the question.

$xml = simplexml_load_file('myfile.xml');

$descriptions = $xml->xpath('//item/description');

foreach ( $descriptions as $description_node ) {

    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    $description_sxml = simplexml_import_dom( $description_dom );

    $imgs = $description_sxml->xpath('//img');
    $text = $description_sxml->xpath('//div');

    foreach($imgs as $image){

    echo (string)$image['src'];     
       }
    foreach($text as $t){

        echo (string)$t;
       }
    }

It is IMSoP's code and i added the $text = $description_sxml->xpath('//div'); to read the text that is inside the <div>.

In my case some of the posts in the xml have multiple <div> and <span> tags, so to parse all of them i might have to add another ->xpath for the <span> or maybe an if... else statement so that if i don't have any content inside <div>, echo the <span> contents instead. Thanks for your replies.

For encoding problems parsing-xml this way, also see this [post](http://stackoverflow.com/questions/14336412/convert-parsed-text-with-php-to-utf-8) — pano, Jan 15 '13 at 22:42

How to parse text and image from complex xml

3 Answers3

Linked