1

Possible Duplicate:
Screen scapingin in php using file_get_contents

Can anyone help me.. I am trying to scrape Hotel reviews from LateRooms.com dont tell me its a bad idea because I already have permission as an affiliate

My code:

<?php
header('content-type: text/plain');

$contents = file_get_contents('http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx');
$contents = preg_replace('/\s(1,)/', ' ', $contents);

print $contents . "\n";

$records = preg_split('/<div id="review/', $contents);

for ($ix = 1; $ix < count($records); $ix++) {

$tmp = $records[$ix];

preg_match('/id="review"/', $tmp, $match_reviews);

print_r($match_reviews);

exit();

}
?>

This works really well the only problem is that It pulls in the whole page of code and doesnt match the div id 'review'

Thanks in advance

Community
  • 1
  • 1

1 Answers1

3
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

$data = curl_exec($ch);
curl_close($ch);

return $data;
}
function DOMinnerHTML($element){ 
$innerHTML = ""; 
$children = $element->childNodes; 
foreach ($children as $child) 
{ 
    $tmp_dom = new DOMDocument(); 
    $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
    $innerHTML.=trim($tmp_dom->saveHTML()); 
} 
return $innerHTML; 
}
$url  = 'http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx';
$html = file_get_contents_curl($url);

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$div_elements = $doc->getElementsByTagName('div');

if ($div_elements->length <> 0){
foreach ($div_elements as $div_element) {
    if ($div_element->getAttribute('class') == 'review newReview'){
        $reviews[] = DOMinnerHTML($div_element);

    }
}
}

print_r($reviews);

Try this, it will return all reviews. You can refine the content as per your requirement.

Abhishek
  • 838
  • 1
  • 6
  • 9