0

I'm having trouble scraping the URLs out of the Google results. This code worked for me for a long time but seems like Google changed a few things this week and now I'm getting a ton of extra characters surrounded by the actual URL I want.

preg_match_all('@<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>@siU',     $results, $matches[$key]);

EDIT

All links come out like this when scraped with the above code

/url?url=http://cooksandtravelbooks.com/write-for-us/&rct=j&sa=U&ei=XdayUNnHBIqDiwKZuYEY&ved=0CBQQFjAA&q=cooking+%5C%22Write+for+Us%5C%22&usg=AFQjCNGMiCiWYY_8JDAhqJggVDW2qHRMfw
Dan
  • 31
  • 1
  • 6
  • 6
    I suggest not using regex to parse html... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. You'd better use something like phpquery or any other DOM parser than regex. – Hugo Dozois Nov 25 '12 at 21:34
  • It works for me. Can you show us an example or two of the incorrect results, and the code you use to retrieve them? (And not in a comment, please. Edit the question and add this info to it.) – Alan Moore Nov 26 '12 at 00:19
  • You're not getting the "actual URLs" because when you click on a Google link you're actually going back to Google and then they redirect you to the real URL. – Andy Lester Nov 26 '12 at 03:11

1 Answers1

3
<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument();
@$dom->loadHTML($data);

foreach($dom->getElementsByTagName('a') as $link) {
    echo $link->getAttribute('href');
    echo "<br />";
}
?>
JP_
  • 1,636
  • 15
  • 26
  • Thanks for posting that. I've added it to http://htmlparsing.com/php.html, which I post in response to most of these PHP HTML parsing questions. – Andy Lester Nov 25 '12 at 21:51
  • Might want to also note that the @suppress character is necessary to hide any errors due to invalid html markup. – JP_ Nov 25 '12 at 22:33
  • 1
    Note added. Thanks. The source for the site is on GitHub if anyone wants to contribute: https://github.com/petdance/htmlparsing/ – Andy Lester Nov 25 '12 at 22:35
  • Unfortunately im still not having any better luck getting the actual URL's with this method so far. – Dan Nov 25 '12 at 23:34