-1

Text to parse

<div id="test">some<b>bold</b> or <i>italic</i> text</div>
<div id="test">and again<b> bold text</b><i>and italic text<i></div>

Result i'd like to have

1 : some bold or italic text
2 : and again blod text and italic text

What I tried

string(//div)
normalize-space(//div)

Give the good formatting answer, but only one result came.

id('test')//text()

Give all text but split the result.

I tried to use string-join, or concat but with no luck. I want to do this in php.

Reno31
  • 9
  • 4
  • For now there isn't I want to see if its possible to do so, or if i have to search an other way. – Reno31 Dec 12 '11 at 11:37
  • well. yes. its possible. – Gordon Dec 12 '11 at 12:20
  • If you read this and try the same thing with SimpleXml, your doing it the wrong way, see [Andrey KNUPP](http://stackoverflow.com/users/982500/andrey-knupp) example. – Reno31 Dec 12 '11 at 14:04

4 Answers4

0

There is not many style marks in html, you can try just create your own function to erase the unwanted html. Something like:

function htmlToText(text) {
    return text.replace(/<i>/i, '').replace(/<b>/i, '').replace(/<s>/i, '').replace(/<span>/i, '');
}
Diogo Melo
  • 1,735
  • 3
  • 20
  • 29
0

You're going to need to use regular expressions here to extract the text from inside the HTML tags. If you're not hot on regex, this site will burn you up.

http://www.regular-expressions.info/

You then use preg_replace (http://php.net/preg_replace) to extract the text using the pattern that you constructed.

Mina
  • 610
  • 7
  • 21
  • you might want to check out http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662 to get a second opinion on that regex part – Gordon Dec 12 '11 at 12:20
  • @Gordon: yeah. after reading 3 or 4 topics on so, I realize now the folly of my answer. :) – Mina Dec 12 '11 at 12:59
0

Try this:

             $dom = new \DOMDocument();
             $dom->loadHTML('<!DOCTYPE HTML>
<html lang="en-US">
<head>
       <meta charset="UTF-8">
       <title></title>
</head>
<body>
       <div id="test1">some<b>bold</b> or <i>italic</i> text</div>
       <div id="test2">and again<b> bold text</b><i>and italic text</i></div>
</body>
</html>');

              $xpath = new \DOMXPath($dom);
              foreach ( $xpath->query('//div[contains(@id,"test")]') as $node ) {
                      echo $node->nodeValue , PHP_EOL;
              }

Outputs:

somebold or italic text
and again bold textand italic text
0

Suppose you have this XML document:

<html>
  <div id="test">some<b>bold</b> or <i>italic</i> text</div>
  <div id="test">and again<b> bold text</b><i>and italic text</i></div>
</html>

Then just use:

string(/*/div[1])

The result of evaluating this XPath expression is:

somebold or italic text

Similarly:

string(/*/div[2])

when evaluated produces:

and again bold textand italic text

In case you want to delimit each text node with space, this cannot be achieved with a single XPath 1.0 expression (can be done with a single XPath 2.0 expression). Instead, you will need to evaluate:

 /*/div[1]//text()

This selects (in a list or array structure, depending on your programming language) all text node descendants of /*/div[1]:

"some" "bold" " or " "italic" " text".

Similarly:

 /*/div[2]//text()

selects (in a list or array structure, depending on your programming language) all text node descendants of /*/div[2]:

Now, using your programming language, you have to concatenate these with intermediate space to produce the final wanted result.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Yes, ok, I understand this, but the actual point is to have, in Xpath query, the text of childs node, not just for one case. – Reno31 Dec 12 '11 at 13:29
  • @Reno31: I totally don't understand your comment -- what exactly are you saying? I have shown how to get the sequence of all text-node descendants and I have shown how to get them concatenated -- all with a single XPath expression. I have also described how to concatenate the individual text-node descendants while inserting intermediate space. What else do you want to do? – Dimitre Novatchev Dec 12 '11 at 13:37
  • nothing, your reponse his great, but, this is too specific, I don't know exactly the number of possible result, and i don't want spent my time count how many node are in it. But my question is mostly due to use of simpleXml wich is great but not perfect, and DOM solve my problem. – Reno31 Dec 12 '11 at 14:00