5

How can I select the string contents of the following nodes:

<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>

I have tried a few things

//span/text()

Doesn't get the bold tag

//span/string(.)

is invalid

string(//span)

only selects 1 node

I am using simple_xml in php and the only other option I think is to use //span which returns:

Array
(
    [0] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test
        )

    [1] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => url
                )

            [b] => test2
        )

)

*note that it is also dropping the "more words" text from the second span.

So I guess I could then flatten the item in the array using php some how? Xpath is preferred, but any other ideas would help too.

bakkal
  • 54,350
  • 12
  • 131
  • 107
spyderman4g63
  • 4,087
  • 4
  • 22
  • 31
  • also tried to use //span//text() but that is splitting the text into separate elements in simple_xml – spyderman4g63 Aug 04 '10 at 19:42
  • 2
    Do you need it with or without the actual b tags? (the content you do need I gather, but what about the tag strings). And how dedicated are you to `SimpleXML` as opposed to `DOM`? – Wrikken Aug 04 '10 at 19:44
  • I would rather not have the b tags, but if they are return they are simple enough to remove. The main goal is to return 1 string for each span. I don't fully understand the difference between simple xml object and a dom object I guess. I create a dom object, load the html to it and then import the dom object in simple_xml. Then I can execute xpath against the object. The return is an array of simple xml objects (I think). This is what I do: $html = new DOMDocument(); @$html->loadHTMLFile($url); $xml = simplexml_import_dom($html); //find all the links $result = $xml->xpath("//span"); – spyderman4g63 Aug 04 '10 at 19:55
  • Added a simple DOM example as an answer. – Wrikken Aug 04 '10 at 20:04

7 Answers7

5
$xml = '<foo>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</foo>';
$dom = new DOMDocument();
$dom->loadXML($xml); //or load an HTML document with loadHTML()
$x= new DOMXpath($dom);
foreach($x->query("//span[@class='url']") as $node) echo $node->textContent;
Wrikken
  • 69,272
  • 8
  • 97
  • 136
3

You dont even need an XPath for this:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('span') as $span) {
    if(in_array('url', explode(' ', $span->getAttribute('class')))) {
        $span->nodeValue = $span->textContent;
    }
}
echo $dom->saveHTML();

EDIT after comment below

If you just want to fetch the string, you can do echo $span->textContent; instead of replacing the nodeValue. I understood you wanted to have one string for the span, instead of the nested structure. In this case, you should also consider if simply running strip_tags on the span snippet wouldnt be the faster and easier alternative.


With PHP5.3 you can also register arbitrary PHP functions for use as callbacks in XPath queries. The following would fetch the content of all span elements and it's child nodes and return it as a single string.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions();
echo $xp->evaluate('php:function("nodeTextJoin", //span)');

// Custom Callback function
function nodeTextJoin($nodes)
{
    $text = '';
    foreach($nodes as $node) {
        $text .= $node->textContent;
    }
    return $text;
}
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • I'm not sure that's what the OP is asking for. What this does is printout the whole document with all markup under the tags removed. i.e. the first span element is now `word test` instead of `word test` – Alex Jasmin Aug 04 '10 at 20:58
  • @Alexandra the OPs comment below the question reads *The main goal is to return 1 string for each span.*. I interpreted this as replace the original string, but now that you say it, yes, might be wrong. – Gordon Aug 04 '10 at 21:04
  • Yeah, my main goal was to convert the contents of the span to a string. simple xml was taking the tags and converting them to an array. – spyderman4g63 Aug 05 '10 at 00:11
  • Hmm, never really _needed_ the `registerPHPFunctions`, but it would have saved quite some time in the past. Noted! – Wrikken Aug 05 '10 at 09:46
  • @Wrikken I've yet to find a real need for them too. The main downside is having to write `php:function("functioname", ...` and `php:functionString("functioname", ...` - that's just so cumbersome. And your XPath queries will no longer be portable to other languages then. But, since it's possible and it's not a well known feature, I thought I add them here. @salathe made a blog entry about this at http://cowburn.info/2009/10/23/php-funcs-xpath/ – Gordon Aug 05 '10 at 10:12
  • Portability would be an issue indeed, but for a quick & dirty "i just need to do this particular thing once and then we're done" it really saves time. Or to mimic some XPath 2.0 functionality back in good ol' XPath 1.0 when working with data expecting 2.0. – Wrikken Aug 05 '10 at 10:33
  • @Wrikken yup, actually I remembered the possibility to use PHP functions when Alejandro mentioned XPath2 in his answer. – Gordon Aug 05 '10 at 10:50
2

Using XMLReader:

$xmlr = new XMLReader;
$xmlr->xml($doc);
while ($xmlr->read()) {
    if (($xmlr->nodeType == XmlReader::ELEMENT) && ($xmlr->name == 'span')) {
        echo $xmlr->readString();
    }
}

Output:

word
test

word
test2
more words
GZipp
  • 5,386
  • 1
  • 22
  • 18
1

SimpleXML doesn't like mixing text nodes with other elements, that's why you're losing some content there. The DOM extension, however, handles that just fine. Luckily, DOM and SimpleXML are two faces of the same coin (libxml) so it's very easy to juggle them. For instance:

foreach ($yourSimpleXMLElement->xpath('//span') as $span)
{
    // will not work as expected
    echo $span;

    // will work as expected
    echo textContent($span);
}

function textContent(SimpleXMLElement $node)
{
    return dom_import_simplexml($node)->textContent;
}
Josh Davis
  • 28,400
  • 5
  • 52
  • 67
  • 1
    Interesting. But it's just simpler to just use the DOM for everything as in @Wrikken answer – Alex Jasmin Aug 04 '10 at 20:54
  • DOM is an order of magnitude more complicated to use than SimpleXML but yeah, whatever works for you. – Josh Davis Aug 04 '10 at 23:15
  • Sorry. I don't mean we should use the DOM all the time. DOM code can get horribly verbose. But in the context of this simple task I don't see the point of mixing the two APIs. In fact you save a few keystrokes by not calling dom_import_simplexml() in this case – Alex Jasmin Aug 05 '10 at 01:49
0
//span//text()

This may be the best you can do. You'll get multiple text nodes because the text is stored in separate nodes in the DOM. If you want a single string you'll have to just concatenate the text nodes yourself since I can't think of a way to get the built-in XPath functions to do it.

Using string() or concat() won't work because these functions expect string arguments. When you pass a node-set to a function expecting a string, the node-set is converted to a string by taking the text content of the first node in the node-set. The rest of the nodes are discarded.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
0

How can I select the string contents of the following nodes:

First, I think your question is not clear.

You could select the descendant text nodes as John Kugelman has answer with

//span//text()

I recommend to use the absolute path (not starting with //)

But with this you would need to process the text nodes finding from wich parent span they are childs. So, it would be better to just select the span elements (as example, //span) and then process its string value.

With XPath 2.0 you could use:

string-join(//span, '.')

Result:

word test. word test2 more words

With XSLT 1.0, this input:

<div>
<span class="url">
 word
 <b class=" ">test</b>
</span>

<span class="url">
 word
 <b class=" ">test2</b>
 more words
</span>
</div>

With this stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="span[@class='url']">
        <xsl:value-of select="concat(substring('.',1,position()-1),normalize-space(.))"/>
    </xsl:template>
</xsl:stylesheet>

Output:

word test.word test2 more words
  • [DOM uses libxml](http://www.php.net/manual/en/dom.requirements.php) and [libxml does not support XPath 2.0](http://xmlsoft.org/index.html) – Gordon Aug 04 '10 at 21:12
  • @Gordon: "but any other ideas would help too." –  Aug 04 '10 at 21:15
  • @Alejandro just saying, in case anybody tries and wonders why it wont work – Gordon Aug 04 '10 at 21:27
  • @Gordon: And I add a XPath 2.0 solution because It would be good that more people know its new features and update their platform or request vendors to do so. –  Aug 05 '10 at 12:59
0

Along the lines of Alejandro's XSLT 1.0 "but any other ideas would help too" answer...

XML:

<?xml version="1.0" encoding="UTF-8"?>
<div>
    <span class="url">
        word
        <b class=" ">test</b>
    </span>
    <span class="url">
        word
        <b class=" ">test2</b>
        more words
    </span>
</div>

XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    <xsl:template match="span">
        <xsl:value-of select="normalize-space(data(.))"/>
    </xsl:template>
</xsl:stylesheet>

OUTPUT:

word test
word test2 more words
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
  • Thanks. I'm pretty sure this would work if I was going to go with XSL, but the xpath example is better for the little thing I am doing. I get used to some custom extension we use at work that are not in EXSLT also. – spyderman4g63 Aug 05 '10 at 00:20
  • `fn:data()` is XPath 2.0, so I think you should say this solution is **XSLT 2.0** –  Aug 05 '10 at 12:27