fetch html content from website

Question

Possible Duplicate:
How to parse and process HTML with PHP?

I have used this code for fetching html content from given website of url.

**Code:**

=================================================================

example URL: http://www.qatarsale.com/EnMain.aspx

/*

$regexp = '/<div id="UpdatePanel4">(.*?)<\/div>/i';

@preg_match_all($regexp, @file_get_contents('http://www.qatarsale.com/EnMain.aspx'), $matches, PREG_SET_ORDER);*/

/*

but $matches returns blank array. I want fetch all html content that are found in div id="UpdatePanel4".

If anybody have any solution please suggest me.

Thanks

xdazz · Answer 1 · 2012-06-28T07:44:54.067

3

First, make sure the server let you fetch the data.

Second, use a html parser instead to parse the data.

$html = @file_get_contents('http://www.qatarsale.com/EnMain.aspx');
if (!$html) {
  die('can not get the content!');
}
$doc = new DOMDocument();
$doc->loadHTML($html);
$content = $doc->getElementById('UpdatePanel4');

edited Jun 28 '12 at 07:44

answered Jun 28 '12 at 07:38

xdazz

158,678
38
247
274

2

Oh it feels so good to see someone not suggesting string manipulation or regex. – Adi Jun 28 '12 at 07:39
@AdnanShammout: Looks bad to see a 20k rep guy not linking to the duplicate. – Leigh Jun 28 '12 at 08:38

score 0 · Answer 2 · answered Jun 28 '12 at 07:35

// Gets the webpage
$html = @file_get_contents('http://www.qatarsale.com/EnMain.aspx');

$startingTag = '<div id="UpdatePanel4">';
// Finds the position of the '<div id="UpdatePanel4">
$startPos = strpos($html, $startingTag);
// Get the position of the closing div
$endPos = strpos($html, '</div>', $startPos + strlen($startingTag));
// Get the content between the start and end positions
$contents = substr($html, $startPos + strlen($startingTag), $endPos);

You will have to do a bit more work if that UpdatePanel4 div contains more divs

score 0 · Answer 3 · answered Jun 28 '12 at 07:38

That just wont help. Even if you manage to get the Regexp working, there are two issues with the way you are using it:

What if the server changes minor stuffs of HTML like this: <div data-blah="blah" id="UpdatePanel4">? In that case you too have to change your Regexp.
Second issue: I think you want the innerHTML of the div, right? In that case, the way you are dealing with, using regexp, is not taking any care about nesting or the tree structure. The string you will get is from what you specify, up to the first </div> that is encountered.

Solution:

It is ALWAYS a bad idea to use Regexps to parse HTML. Use a DOMDocument instead.

fetch html content from website

3 Answers3

Linked