Using Xpath with PHP to parse HTML

Question

I'm currently trying to parse some data from a forum. Here is the code:

$xml = simplexml_load_file('https://forums.eveonline.com');

$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[@class='topicViews']");
foreach($names as $name) 
{
    echo $name . "<br/>";
}

Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?

Thanks!

Did you try disabling Javascript in your webbrowser? Your PHP will not use it, hence any change done by javascript on the website will not be there on the server. — Manuel Schweigert, Dec 05 '12 at 07:49
JS isn't being run on the page I'm running this. I understand that XPath is for XML, but from what I've seen through Google searches, it's popular to use for HTML as well. — VixenSoul, Dec 05 '12 at 08:32

score 46 · Answer 1 · answered Dec 05 '12 at 08:06

My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.

The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.

/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[@class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
    echo "Node($i): ", $node->nodeValue, "\n";
}

Damien Overeem · Answer 2 · 2018-11-21T11:07:26.247

3

A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables. You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.

This way you can make your code a bit more resilient against changes in the html source.

I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.

A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp

edited Nov 21 '18 at 11:07

answered Dec 05 '12 at 07:57

Damien Overeem

4,487
4
36
55

How can I get entire HTML tags (matching) tag, I don't need it like array. In my case i am using XPath as '//math' to select all math tag in html which later on I have to change with image – Akshay Bajpei Oct 01 '20 at 14:27

Using Xpath with PHP to parse HTML

2 Answers2

Linked