2

I am doing some HTML scraping and have hit a wall with this one query. I am trying to return a set of values from the following HTML page structure:

<div id="product-grid">
    <ul>
        <li><div class="price">Cash Price: $20.00</div></li>
        <li><div class="price">Cash Price: $30.00</div></li>
        <li><div class="price">Cash Price: $40.00</div></li>
    </ul>
</div>

I am trying to get the "$20.00" prices returned in a list. If I use the following XPath:

id('product-grid')//p[@class="price"] 

I get a result list of all the "Cash Price: $40.00". If I try the following query:

substring-after(id('product-grid')//p[@class="price"] , "Price: ")

I get the correct output, but only get the first result. Anyone know how I can get all results?

I am running PHP5.3.3 with libxml 2.7.8 for the XPath. I am calling the xpath as follows:

$xpath = new DOMXPath( $html ); 
$resultset= $xpath->query($query);

I have been googling like mad trying to find out why this is happening! Please help!

Paweł Tomkiel
  • 1,974
  • 2
  • 21
  • 39
Michael
  • 23
  • 4

3 Answers3

1

You have to use substring after getting your list.

 id('product-grid')//div[@class="price"][substring-after(., 'Price: ')]

This should work.

EDIT : This seems to be working. However I can't test the return value as I don't know how to get the substring'd value. What do you use ?

Tom
  • 1,647
  • 11
  • 24
  • 1
    Using a function on the axis is an XPath 2.0 feature. Probably not available in standard PHP environment. You should be able to apply it to a predicate filter: `id('product-grid')//p[@class="price"][substring-after(., 'Price: ')]. Also, the sample XML shows `div` elements with `@class`, but the example XPath (and your answer) expect `p` to have `@class`. – Mads Hansen Sep 18 '11 at 11:45
  • @Mads Hansen, post edited to comply with 1.0. I used OP's code so I used p. Changed it to div indeed. – Tom Sep 18 '11 at 11:51
1

Sorry, but I don't think that this is possible in one step. As far as I know XPath 1.0 does not support function calls at the end of an XPath path. The answer here indicates the same.

Furthermore you must not use id('product-grid') as the first path part because the id is on the root element and does not need to be selected specially. If your sample XML is just a fragment of a larger XML document, the id() might be necessary though.

The following works as expected:

$xml = new DOMDocument();
$xml->loadXML('<div id="product-grid">
 <ul>
  <li><div class="price">Cash Price: $20.00</div></li>
  <li><div class="price">Cash Price: $30.00</div></li>
  <li><div class="price">Cash Price: $40.00</div></li>
</ul>
</div>');
$xpath = new DOMXPath($xml);
foreach ($xpath->query('//div[@class="price"]') as $n) {
    var_dump(substr($n->nodeValue, strpos($n->nodeValue, '$')));
}   
Community
  • 1
  • 1
Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
1

The wanted processing cannot be specified just as a single XPath 1.0 expression, because by definition any function that expects a single string argument but is given a node-set, takes the string value of the first only (in document order) node of this node-set.

Also, unlike XPath 2.0 in XPath 1.0 it isn't allowed to specify a function call as a location step.

Therefore, one solution is to issue this XPath expression:

substring-after((id('product-grid')//p[@class="price"])[$k], "Price: ") 

N times, substituting $k in each expression with 1,2,..., N, where N is the result of evaluating another XPath expression:

count(id('product-grid')//p[@class="price"])

Using XPath 2.0 one can do this with this simple and single expression:

id('product-grid')//p[@class="price"]/substring-after(., "Price: ")

which when evaluated produces exactly the wanted sequence of strings.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431