1

I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:

<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>

and when I copy the xpath in chrome I get this string:

/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]

I'm having trouble figuring out where I put that string in an xpath query. If not an xpath query maybe I should do a preg_match?

I tried this:

$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
  echo $node, "\n";
}

but nothing is printed to the page.

Thanks.

EDIT: Full sourse code here: http://pastebin.com/K5tZ4dFH

EDIT2: Cleaner code screen shot: https://i.stack.imgur.com/Y9mDg.png

Xander Luciano
  • 3,753
  • 7
  • 32
  • 53
  • Try echo `$node->item(0);` inside the loop. – Rikesh Dec 17 '13 at 05:43
  • nothing being output still. The page isn't very cleanly coded, may I need to clean up the DOM? Other codes seemed to have something to clean it up but the method error'd out and I couldn't find any documentation on the method. @Rikesh – Xander Luciano Dec 17 '13 at 05:59
  • @hwnd its above in the code, but the code is so messy I tried to just shorten it down. edited with a pastebin to source – Xander Luciano Dec 17 '13 at 06:31
  • possible duplicate of [Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?](http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the) – Jens Erat Dec 17 '13 at 08:53

2 Answers2

1

From looking at your source, try the following:

$html = file_get_contents($URL);

$doc = new DOMDocument();
$doc->loadHTML($html); 

$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(@style, 'left:20px')]");

foreach ($nodes as $node) {
   echo $node->textContent;
}
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Yea I just reduced the source code down because it was messy. I added a pastebin of the sourcecode if you can make sense of it. – Xander Luciano Dec 17 '13 at 06:36
  • http://i.imgur.com/lWKheBy.png Maybe this image will help, it shows the code a little more organized. I'm trying to get the title of something off a site. – Xander Luciano Dec 17 '13 at 06:41
1

It looks like you want the text just before the first </div>, so this regex will find that:

[^<>]+(?=<\/div>)

Here's a live demo.

Bohemian
  • 412,405
  • 93
  • 575
  • 722