0

I have following code:

<?php
   libxml_use_internal_errors(true);
   $dom = new DomDocument;
   $dom->loadHTMLFile('http://www.next.co.uk/x532328s4');
   $xpath = new DomXPath($dom);

   $nodes_working = $xpath->query("//html/body/section/section/div/div/div/section/article/section/div/div/h1");
   $nodes_not_working = $xpath->query("//html/body/section/section/div/div[2]/div/section[2]/article/section[2]/div[2]/div/h1");

   echo '<pre>';
   print_r($nodes_working);
   echo '</pre>';
   foreach ($nodes_working as $i => $node) {
      echo "Node($i): ", $node->nodeValue, "\n";
   }

   echo '<pre>';
   print_r($nodes_not_working);
   echo '</pre>';
   foreach ($nodes_not_working as $i => $node) {
      echo "Node($i): ", $node->nodeValue, "\n";
   }
?>

Now, the problem is that the path given in $nodes_working is not really correct although it grabs some data sometimes.

On the other hand, the second path which is really correct given in $nodes_not_working doesn't grab anything because it consists numbers which precise which element is the right one if there are more than one 'pararel' elements. So it seems the parser doesn't know what to do when encounter the numeric values there.

My question is: how can I catch the right data in PHP using Xpaths in a format given in $nodes_not_working ?

Filip
  • 3
  • 1
  • Please do not let us guess what you want to query, but also post expected output. – Jens Erat Feb 12 '14 at 20:59
  • In my example the output should be the title of the product which is: Laser Cross Strap Sandals It works in the first example but it's just a coincidence. So basically, I want to find a value (or whatever is inside) stored in the given path. – Filip Feb 13 '14 at 13:15

1 Answers1

0

Try to shorten up the path by using predicates on identifier or class attributes. For example, in this case matching the parent's element's class attribute seems reasonable.

//div[@class='Title']/h1

For matching HTML class attributes, also consider "How can I find an element by CSS class with XPath?".

Community
  • 1
  • 1
Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • Well, this kind of path worked in my example (tested before) but it's not quite accurate because there can be more than one div with that class and including h1 on the website. – Filip Feb 13 '14 at 13:14
  • Then you might analyze the page for further structures to match against, eg. the section with ProductDetail class.. As there are no identifiers around, you cannot make use of any unique way to address the headline. Using the full path like you did might be more specific on the element, but is much more fragile in case of (even small) changes to the HTML. – Jens Erat Feb 13 '14 at 13:18
  • You are right about cons/pros of using such a solution and I am aware of them. However, it doesn't give me the answer if the built-in, PHP parser can deal with the paths I have and which are pretty correct :) – Filip Feb 13 '14 at 13:25
  • Well, syntactically the paths are totally fine; the second one just doesn't match whatever it should. If that path worked on your computer / in your browser, this could happen because: the site returns different HTML for different IP ranges, browsers, ...; the site already changed the HTML; the site sometimes ships additional elements like advertisement; or some elements get added on-the-fly using JavaScript (which is not executed within PHP). Anyway: `` is always the root element, no need to use `//` in the beginning, just use `/`. – Jens Erat Feb 13 '14 at 14:03
  • So there should be no problem with reading paths like? .../div[2]/... If so - I will need to search more with this particular example. – Filip Feb 13 '14 at 14:27
  • No, these are totally fine. The path just doesn't fit the HTML input. – Jens Erat Feb 13 '14 at 15:17
  • Unfortunately the path seems to be an original path without any JS or other injections. When you follow this: //html/body/section/section/div/div[2]/div/section[2]/article/section[2]/div[2]/div/h1 in Firebug you will get: Laser Cross Strap Sandals And it really doesn't look like any elements of this path were modified content. – Filip Feb 14 '14 at 11:01
  • [Firebug does not show the original HTML code](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the). Other HTML parser can construct another XML representation. In case of problems, always look at the _raw source code_, not into Firebug. The second path might fit the DOM representation in Firebug, but does not for other HTML parsers. – Jens Erat Feb 14 '14 at 11:34
  • Thanks, you were right! It seems that the parser works on some websites using even numbers in the path, so it looks like the source code (html) is a problem, not the parser itself. – Filip Feb 14 '14 at 12:11
  • The source code might be totally fine, too; it just gets parsed in different ways. For writing change-resistant and compatible XPath queries, use the fewest number of axis steps possible to uniquely identify an element, and try to avoid positional predicates whenever possible. – Jens Erat Feb 14 '14 at 12:20