0

I know understand that it's not advised to using regular expressions to parse HTML. I'm using the following regex to get the data inside of a element that comes directly after a element.

$string = "</th><td>Capture This</td>";
$pattern = "/<\/th>.*<td>(.*)<\/td>$/";

preg_match ($pattern, $string, $matches);

echo("<pre>" . $matches[0] . "</pre>");

Can somebody please explain to me how I'd go about capturing the contents of a <td> element that comes directly after the closing tag of a <th> element using PHP's DOMDocument or similar functionality?

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
hhwhy
  • 23
  • 2
  • 4

1 Answers1

0

It can easily be fetched with Simple HTML DOM for PHP:

http://simplehtmldom.sourceforge.net/

Post some more of the source and I will give you the element path

abcde123483
  • 3,885
  • 4
  • 41
  • 41
  • Ah! That library looks great, but yeah if you can help me that would be awesome. Here is a pastebin: http://pastebin.com/rGNBbVAK – hhwhy Nov 22 '11 at 17:53
  • As I said you did not post enough details of HTML source to give a reliable XPath – abcde123483 Nov 22 '11 at 17:55
  • Sorry about that, here is more code. The text inside of the preceding element is ALWAYS the same and completely unique. So basically, I need to get the contents of a element that follows a element who has text inside that says "Unique" for example. http://pastebin.com/rGNBbVAK – hhwhy Nov 22 '11 at 17:59
  • 2
    Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Nov 22 '11 at 18:12
  • @bow-viper1 your example still presents of with precious little information compared with the full source of the page – abcde123483 Nov 22 '11 at 18:26
  • @ulvund I've just pastebin'd the full form. There is no identifiable information outside of that, unfortunately :( http://pastebin.com/Vx8kGU7V – hhwhy Nov 22 '11 at 22:46