15

I'm currently trying to parse some data from a forum. Here is the code:

$xml = simplexml_load_file('https://forums.eveonline.com');

$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[@class='topicViews']");
foreach($names as $name) 
{
    echo $name . "<br/>";
}

Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?

Thanks!

nikc.org
  • 16,462
  • 6
  • 50
  • 83
VixenSoul
  • 163
  • 1
  • 1
  • 4
  • 1
    Did you try disabling Javascript in your webbrowser? Your PHP will not use it, hence any change done by javascript on the website will not be there on the server. – Manuel Schweigert Dec 05 '12 at 07:49
  • XPath is for XML, not for HTML. – GolezTrol Dec 05 '12 at 07:53
  • 1
    JS isn't being run on the page I'm running this. I understand that XPath is for XML, but from what I've seen through Google searches, it's popular to use for HTML as well. – VixenSoul Dec 05 '12 at 08:32

2 Answers2

46

My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.

The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.

/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[@class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
    echo "Node($i): ", $node->nodeValue, "\n";
}
Sherif
  • 11,786
  • 3
  • 32
  • 57
3

A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables. You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.

This way you can make your code a bit more resilient against changes in the html source.

I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.

A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

Damien Overeem
  • 4,487
  • 4
  • 36
  • 55
  • How can I get entire HTML tags (matching) tag, I don't need it like array. In my case i am using XPath as '//math' to select all math tag in html which later on I have to change with image – Akshay Bajpei Oct 01 '20 at 14:27