How do I grab part of a page's HTML DOM with PHP?

Question

I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)

I know that the content starts off as <div id="content"> and ends as </div><div id="footer">

What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...

header('Content-type: text/plain');

$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');

$start = '<div id="content">';
$end = '<div id="footer">';

$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);

echo $foo;

UPDATE

I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.

Also this might help: http://stackoverflow.com/q/3577641/642173 — Melsi, Oct 19 '11 at 05:19

Michael Low · Accepted Answer · 2014-01-17T09:00:50.860

1

Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.

If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM

edited Jan 17 '14 at 09:00

answered Oct 19 '11 at 05:21

Michael Low

24,276
16
82
119

score 0 · Answer 2 · answered Oct 19 '11 at 05:16

0

if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net really good for this sort of thing.

answered Oct 19 '11 at 05:16

Last Rose Studios

2,461
20
30

score 0 · Answer 3 · answered Oct 19 '11 at 05:31

0

Do not use regex, it can fail. Use PHP's inbuilt DOM parse : http://php.net/manual/en/class.domdocument.php

You can easily traverse and parse relevant content .

answered Oct 19 '11 at 05:31

DhruvPathak

42,059
16
116
175

Seems like the regex is less likely to fail than trying to load and parse a DOM that may have errors... – cwd Oct 19 '11 at 05:33

How do I grab part of a page's HTML DOM with PHP?

3 Answers3