Trying to retrieve text only from a div with xpath

Question

I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:

<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>

and when I copy the xpath in chrome I get this string:

/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]

I'm having trouble figuring out where I put that string in an xpath query. If not an xpath query maybe I should do a preg_match?

I tried this:

$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
  echo $node, "\n";
}

but nothing is printed to the page.

Thanks.

EDIT: Full sourse code here: http://pastebin.com/K5tZ4dFH

EDIT2: Cleaner code screen shot: https://i.stack.imgur.com/Y9mDg.png

nothing being output still. The page isn't very cleanly coded, may I need to clean up the DOM? Other codes seemed to have something to clean it up but the method error'd out and I couldn't find any documentation on the method. @Rikesh — Xander Luciano, Dec 17 '13 at 05:59
@hwnd its above in the code, but the code is so messy I tried to just shorten it down. edited with a pastebin to source — Xander Luciano, Dec 17 '13 at 06:31
possible duplicate of [Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the) — Jens Erat, Dec 17 '13 at 08:53

hwnd · Accepted Answer · 2013-12-17T07:22:11.127

1

From looking at your source, try the following:

$html = file_get_contents($URL);

$doc = new DOMDocument();
$doc->loadHTML($html); 

$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(@style, 'left:20px')]");

foreach ($nodes as $node) {
   echo $node->textContent;
}

edited Dec 17 '13 at 07:22

answered Dec 17 '13 at 06:35

hwnd

69,796
4
95
132

Yea I just reduced the source code down because it was messy. I added a pastebin of the sourcecode if you can make sense of it. – Xander Luciano Dec 17 '13 at 06:36
http://i.imgur.com/lWKheBy.png Maybe this image will help, it shows the code a little more organized. I'm trying to get the title of something off a site. – Xander Luciano Dec 17 '13 at 06:41

Bohemian · Answer 2 · 2014-03-27T12:08:34.653

1

It looks like you want the text just before the first </div>, so this regex will find that:

[^<>]+(?=<\/div>)

Here's a live demo.

edited Mar 27 '14 at 12:08

answered Dec 17 '13 at 14:06

Bohemian

412,405
93
575
722

Trying to retrieve text only from a div with xpath

2 Answers2