Selecting all childs text value

Question

Text to parse

<div id="test">some<b>bold</b> or <i>italic</i> text</div>
<div id="test">and again<b> bold text</b><i>and italic text<i></div>

Result i'd like to have

1 : some bold or italic text
2 : and again blod text and italic text

What I tried

string(//div)
normalize-space(//div)

Give the good formatting answer, but only one result came.

id('test')//text()

Give all text but split the result.

I tried to use string-join, or concat but with no luck. I want to do this in php.

For now there isn't I want to see if its possible to do so, or if i have to search an other way. — Reno31, Dec 12 '11 at 11:37
If you read this and try the same thing with SimpleXml, your doing it the wrong way, see [Andrey KNUPP](http://stackoverflow.com/users/982500/andrey-knupp) example. — Reno31, Dec 12 '11 at 14:04

Diogo Melo · Answer 1 · 2011-12-12T12:18:39.130

0

There is not many style marks in html, you can try just create your own function to erase the unwanted html. Something like:

function htmlToText(text) {
    return text.replace(/<i>/i, '').replace(/<b>/i, '').replace(/<s>/i, '').replace(/<span>/i, '');
}

edited Dec 12 '11 at 12:18

answered Dec 12 '11 at 11:39

Diogo Melo

1,735
3
20
29

Can you be more specific and giving a example of use of these fonction ? – Reno31 Dec 12 '11 at 12:13
Hi, I just updated the answer ;) – Diogo Melo Dec 12 '11 at 12:18

score 0 · Answer 2 · answered Dec 12 '11 at 11:45

0

You're going to need to use regular expressions here to extract the text from inside the HTML tags. If you're not hot on regex, this site will burn you up.

http://www.regular-expressions.info/

You then use preg_replace (http://php.net/preg_replace) to extract the text using the pattern that you constructed.

answered Dec 12 '11 at 11:45

Mina

610
7
21

you might want to check out http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662 to get a second opinion on that regex part – Gordon Dec 12 '11 at 12:20
@Gordon: yeah. after reading 3 or 4 topics on so, I realize now the folly of my answer. :) – Mina Dec 12 '11 at 12:59

score 0 · Accepted Answer · answered Dec 12 '11 at 12:27

0

Try this:

             $dom = new \DOMDocument();
             $dom->loadHTML('<!DOCTYPE HTML>
<html lang="en-US">
<head>
       <meta charset="UTF-8">
       <title></title>
</head>
<body>
       <div id="test1">some<b>bold</b> or <i>italic</i> text</div>
       <div id="test2">and again<b> bold text</b><i>and italic text</i></div>
</body>
</html>');

              $xpath = new \DOMXPath($dom);
              foreach ( $xpath->query('//div[contains(@id,"test")]') as $node ) {
                      echo $node->nodeValue , PHP_EOL;
              }

Outputs:

somebold or italic text
and again bold textand italic text

answered Dec 12 '11 at 12:27

Did this work with simpleXml ? – Reno31 Dec 12 '11 at 12:33
I Think yes .. how you can get the nodeValue with xml ? – Dec 12 '11 at 12:35
why the \DOMXpath and not just DOMXpath ? – Reno31 Dec 12 '11 at 13:30
Opz, because I coded it in a file that was using namespace – Dec 12 '11 at 13:35
Ok, so It work with DOM, but the result is different with simpleXml, wich give an array with all childs in it. – Reno31 Dec 12 '11 at 13:49

score 0 · Answer 4 · answered Dec 12 '11 at 13:15

Suppose you have this XML document:

<html>
  <div id="test">some<b>bold</b> or <i>italic</i> text</div>
  <div id="test">and again<b> bold text</b><i>and italic text</i></div>
</html>

Then just use:

string(/*/div[1])

The result of evaluating this XPath expression is:

somebold or italic text

Similarly:

string(/*/div[2])

when evaluated produces:

and again bold textand italic text

In case you want to delimit each text node with space, this cannot be achieved with a single XPath 1.0 expression (can be done with a single XPath 2.0 expression). Instead, you will need to evaluate:

 /*/div[1]//text()

This selects (in a list or array structure, depending on your programming language) all text node descendants of /*/div[1]:

"some" "bold" " or " "italic" " text".

Similarly:

 /*/div[2]//text()

selects (in a list or array structure, depending on your programming language) all text node descendants of /*/div[2]:

Now, using your programming language, you have to concatenate these with intermediate space to produce the final wanted result.

Yes, ok, I understand this, but the actual point is to have, in Xpath query, the text of childs node, not just for one case. — Reno31, Dec 12 '11 at 13:29
@Reno31: I totally don't understand your comment -- what exactly are you saying? I have shown how to get the sequence of all text-node descendants and I have shown how to get them concatenated -- all with a single XPath expression. I have also described how to concatenate the individual text-node descendants while inserting intermediate space. What else do you want to do? — Dimitre Novatchev, Dec 12 '11 at 13:37
nothing, your reponse his great, but, this is too specific, I don't know exactly the number of possible result, and i don't want spent my time count how many node are in it. But my question is mostly due to use of simpleXml wich is great but not perfect, and DOM solve my problem. — Reno31, Dec 12 '11 at 14:00

Selecting all childs text value

Text to parse

Result i'd like to have

What I tried

4 Answers4