2

I want to retrieve the data of the next element tag in a document, for example:

I would like to retrieve <blockquote> Content 1 </blockquote> for every different span only.

<html>
<body>


<span id=12341></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->

<span id=12342></span>
<blockquote>Content 1</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->

<span id=12343></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>
<blockquote>Content 4</blockquote>

<!-- misc html in between including other spans w/ no relative blockquotes-->    

<span id=12344></span>
<blockquote>Content 1</blockquote>
<blockquote>Content 2</blockquote>
<blockquote>Content 3</blockquote>


</body>
</html>

Now two things I'm wondering:

1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?

2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?

Gordon
  • 312,688
  • 75
  • 539
  • 559
Tek
  • 2,888
  • 5
  • 45
  • 73
  • I don't know much about PHP, but isn't PHP on the server-side and DOM on the client-side? – Šime Vidas Nov 27 '10 at 14:05
  • Yes, but PHP can be used to retrieve / process documents and parse all different kinds of data. See http://php.net/manual/en/book.dom.php – Tek Nov 27 '10 at 14:31
  • If I could, I would love to use Javascript but you can't parse html from external addresses since it's bould to the Same Origin Policy... – Tek Nov 27 '10 at 14:40
  • Good question, +1. See my answer for explanation and complete solution. – Dimitre Novatchev Nov 27 '10 at 19:18

4 Answers4

3

Now two things I'm wondering:

1.)How can I write an expression that matches and only outputs a blockquote that's followed right after a closed element (<span></span>)?

Assuming that the provided text is converted to a well-formed XML document (you need to enclose the values of the id attributes in quotes)

Use:

/*/*/span/following-sibling::*[1][self::blockquote]

This means in English: Select all blockquote elements each of which is the first, immediate following sibling of a span element that is a grand-child of the top element of the document.

2.)If I wanted, how could I get Content 2, Content 3, etc if I ever have a need to output them in the future while still applying to the rules of the previous question?

Yes.

You can get all sets of contigious blockquote elements following a span:

 /*/*/span/following-sibling::blockquote
          [preceding-sibling::*[not(self::blockquote)][1][self::span]]

You can get the contigious set of blockquote elements following the (N+1)-st span by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=$vN]
           ]

where $vN should be substituted by the number N.

Thus, the set of contigious set of blockquote elements following the first span is selected by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=0]
           ]

the set of contigious set of blockquote elements following the second span is selected by:

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=1]
           ]

etc. ...

See in the XPath Visualizer the nodes selected by the following expression :

/*/*/span/following-sibling::blockquote
           [preceding-sibling::*
             [not(self::blockquote)][1]
                [self::span and count(preceding-sibling::span)=3]
           ]

alt text

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
0

Short answer: Load your HTML into DOMDocument, and select the nodes you want with XPath.

http://www.php.net/DOM

Long answer:

$flag = false;
$TEXT = array();
foreach ($body->childNodes as $el) {
    if ($el->nodeName === '#text') continue;
    if ($el->nodeName === 'span') {
        $flag = true;
        continue;
    }
    if ($flag && $el->nodeName === 'blockqoute') {
        $TEXT[] = $el->firstChild->nodeValue;
        $flag = false;
        continue;
    }
}
timdream
  • 5,914
  • 5
  • 21
  • 24
  • That's the question I'm asking. I don't know how to write something that will output the content of `
    ` ONLY if it's after a `
    – Tek Nov 27 '10 at 15:09
  • Actually it's not a total solution, yet I think you will be able to figure out how to extract Content 2/3 and filter out other `span` from the same `foreach` pattern. – timdream Nov 27 '10 at 15:31
  • What do you mean by "Content 2/3"? And why did you write it without XPath? I can use that as a solution as well. – Tek Nov 27 '10 at 15:46
  • You asked "how could I get Content 2, Content 3, etc" in the question? – timdream Nov 27 '10 at 15:48
  • Oh, sorry. I misread that. But anyway, what's the way to do it with XPath? – Tek Nov 27 '10 at 15:51
  • XPath is like CSS selectors, where you should get the NodeList you want right away if you made the right query. Given the fact your HTML is quite flat, XPath would be a bit of overkill. – timdream Nov 27 '10 at 15:59
0

Try the following *

/html/body/span/following-sibling::*[1][self::blockquote]

to match any first blockquotes after a span element that are direct children of body or

//span/following-sibling::*[1][self::blockquote]

to match any first blockquotes following a span element anywhere in the document

* edit: fixed Xpath. Credits to Dimitre. My initial version would match any first blockquote after the span, e.g. it would match span p blockquote, which is not what you wanted.

Both of the above would match "Content 1" blockquotes. If you'd want to match the other blockquotes following the span (siblings, not descendants) remove the [1]

Example:

$dom = new DOMDocument;
$dom->load('yourFile.xml');
$xp = new DOMXPath($dom);
$query = '/html/body/span/following-sibling::*[1][self::blockquote]';
foreach($xp->query($query) as $blockquote) {
    echo $dom->saveXml($blockquote), PHP_EOL;
}

If you want to do that without XPath, you can do

$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('yourFile.xml');
$body = $dom->getElementsByTagName('body')->item(0);
foreach($body->getElementsByTagName('span') as $span) {
    if($span->nextSibling !== NULL &&
       $span->nextSibling->nodeName === 'blockquote')
    {
        echo $dom->saveXml($span->nextSibling), PHP_EOL;
    }
}

If the HTML you scrape is not valid XHTML, use loadHtmlFile() instead to load the markup. You can suppress errors with libxml_use_internal_errors(TRUE) and libxml_clear_errors().

Also see Best methods to parse HTML for alternatives to DOM (though I find DOM a good choice).

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • I'm not loading XML, but shouldn't the equivalent of saveXml() be saveHTML? Because when I try saveHTML I get a `Warning: DOMDocument::saveHTML() expects exactly 0 parameters, 1 given in...` I'm guessing you can't pass a variable to the function... – Tek Nov 28 '10 at 00:15
  • @Tek yes, it should, but isnt. saveXml allows you pass a node as an argument which will then give you the outerXml of the node. [As of this writing this is not possible with saveHtml but there is a bug for this in the bugtracker](http://bugs.php.net/bug.php?id=50973). Since XHTML and HTML blockquotes dont differ, you can safely use saveXML here. – Gordon Nov 28 '10 at 10:59
0

Besides @Dimitre good answer, you could also use:

/html
   /body
      /blockquote[preceding-sibling::*[not(self::blockquote)][1]
                     /self::span[@id='12341']]
Community
  • 1
  • 1
  • Thanks for this, it will be useful if I ever want to retrieve a specific span with an ID. Too bad it's not like that in this case. – Tek Nov 27 '10 at 22:55
  • @Tek: Check the result. This gets evaluate to those `blockquote` elements having as first not `blockquote` an specific `span` with `id` attribute equal to '12341'. If you want all the `blockquote` elements that follows an inmediate `span` just remove the last nested predicate. –  Nov 27 '10 at 22:59