Why does this regex not match the URLs in this Google results page?

Question

I'm having trouble scraping the URLs out of the Google results. This code worked for me for a long time but seems like Google changed a few things this week and now I'm getting a ton of extra characters surrounded by the actual URL I want.

preg_match_all('@<h3\s*class="r">\s*<a[^<>]*href="([^<>]*)"[^<>]*>(.*)</a>\s*</h3>@siU',     $results, $matches[$key]);

EDIT

All links come out like this when scraped with the above code

/url?url=http://cooksandtravelbooks.com/write-for-us/&rct=j&sa=U&ei=XdayUNnHBIqDiwKZuYEY&ved=0CBQQFjAA&q=cooking+%5C%22Write+for+Us%5C%22&usg=AFQjCNGMiCiWYY_8JDAhqJggVDW2qHRMfw

I suggest not using regex to parse html... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. You'd better use something like phpquery or any other DOM parser than regex. — Hugo Dozois, Nov 25 '12 at 21:34
It works for me. Can you show us an example or two of the incorrect results, and the code you use to retrieve them? (And not in a comment, please. Edit the question and add this info to it.) — Alan Moore, Nov 26 '12 at 00:19
You're not getting the "actual URLs" because when you click on a Google link you're actually going back to Google and then they redirect you to the real URL. — Andy Lester, Nov 26 '12 at 03:11

score 3 · Answer 1 · answered Nov 25 '12 at 21:39

3

<?php
$url = "http://www.google.com";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument();
@$dom->loadHTML($data);

foreach($dom->getElementsByTagName('a') as $link) {
    echo $link->getAttribute('href');
    echo "<br />";
}
?>

answered Nov 25 '12 at 21:39

JP_

1,636
15
26

Thanks for posting that. I've added it to http://htmlparsing.com/php.html, which I post in response to most of these PHP HTML parsing questions. – Andy Lester Nov 25 '12 at 21:51
Might want to also note that the @suppress character is necessary to hide any errors due to invalid html markup. – JP_ Nov 25 '12 at 22:33
1

Note added. Thanks. The source for the site is on GitHub if anyone wants to contribute: https://github.com/petdance/htmlparsing/ – Andy Lester Nov 25 '12 at 22:35
Unfortunately im still not having any better luck getting the actual URL's with this method so far. – Dan Nov 25 '12 at 23:34

Why does this regex not match the URLs in this Google results page?

1 Answers1