47

I would like to scrape the content of this Google search result page using curl. I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.

I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.

    //$url is the same as the link above
    $ch = curl_init();
    $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
    curl_setopt ($ch,CURLOPT_TIMEOUT,120);
    curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
    curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
    curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
    echo curl_exec ($ch);

What do I need to do to get my php code to show the exact content of the page as I would see it on my browser? What am I missing? Can anyone point me to the right direction?

I've seen similar questions on SO, but none with an answer that could help me.

EDIT:

I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL. I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.

7usam
  • 999
  • 1
  • 8
  • 23
  • 1
    $output = curl_exec($ch); echo $output; – Bojan Kovacevic Feb 19 '13 at 09:22
  • @BojanKovacevic I've edited the code to show that I have been doing `echo curl_exec($ch);` I am getting a page returned but not the one I am requesting. – 7usam Feb 19 '13 at 10:00
  • You can't scrape Google search results - Googles' results are their primary IP, they're not going to give it away! - regardless of what you do to your code you'll face many (MANY!) other issues, least of which will include a blacklisted IP. If you're trying to monitor search results or SEO or similar, use proper tracking software such as http://www.seomoz.org/ – LuckySpoon Feb 19 '13 at 11:41
  • @LuckySpoon if I cannot scrape that page, I would like to know why (in terms of technical restrictions). I do not care about getting blacklisted yet, at the moment I just want to scrape this one page. I am not monitoring search results, the tracking software you mention does not suit my need. – 7usam Feb 19 '13 at 11:51
  • 1
    Sure - Google impose their restrictions for their own reasons (such as IP protection as I mention earlier). They don't offer you any way to correctly scrape their results (note lack of Search API on the Products page https://developers.google.com/products/). As far as I'm aware, it's simply not an option. You might have luck on a Google Group for developers or similar? – LuckySpoon Feb 19 '13 at 11:54
  • @LuckySpoon Ok, if that is the case, how is my browser able to get a response, but not my php page? That's what I am struggling to get my head around. I am using the same request headers as far as I can tell. I realise they do not have an API for my purpose which is why I had resorted to scraping. – 7usam Feb 19 '13 at 12:01
  • I think we can all agree Google is much smarter than all of us. Not sure how or why they block it, but they do. If you can get around it, it probably won't be for long. – LuckySpoon Feb 19 '13 at 23:09

5 Answers5

73

this is how:

   /**
     * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
     * array containing the HTTP server response header fields and content.
     */
    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
    }

Example

//Read a web page and check for errors:

$result = get_web_page( $url );

if ( $result['errno'] != 0 )
    ... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
    ... error: no page, no permissions, no service ...

$page = $result['content'];
  • Nope, I get redirected to the main google search page (not the search results that is in my url).Same as what I Had – 7usam Feb 19 '13 at 09:56
  • @7usam I fix my answer to use only in GET! try now! –  Feb 19 '13 at 11:13
  • cURL uses GET by default, unless you specify `CURLOPT_POST` or `CURLOPT_POSTFIELDS`. Tried your code anyway, no change. – 7usam Feb 19 '13 at 11:16
16

For a realistic approach that emulates the most human behavior, you may want to add a referer in your curl options. You may also want to add a follow_location to your curl options. Trust me, whoever said that cURLING Google results is impossible, is a complete dolt and should throw his/her computer against the wall in hopes of never returning to the internetz again. Everything that you can do "IRL" with your own browser can all be emulated using PHP cURL or libCURL in Python. You just need to do more cURLS to get buff. Then you will see what I mean. :)

  $url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1');
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);
Robert Sinclair
  • 4,550
  • 2
  • 44
  • 46
712011m4n
  • 161
  • 1
  • 2
  • 2
    With the `urlencode()` around the whole `$url`, you end up escaping the "://" etc., which cURL doesn't like. To get this to work, just `urlencode($strSearch)` in the `$url`, and remove `urlencode()` from the `CURLOPT_URL` line. – Paul Calcraft Oct 14 '15 at 11:25
  • 1
    since you're talking about adding a referer may be you should have added it in your code snippet? curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1'); – Robert Sinclair Nov 15 '19 at 17:29
5

Try This:

$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_VERBOSE, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
  curl_setopt($ch, CURLOPT_URL, urlencode($url));
  $response = curl_exec($ch);
  curl_close($ch);
One Man Crew
  • 9,420
  • 2
  • 42
  • 51
  • My request URL is a bit more complicated than yours. The code works for a simple query like you've provided, but not for mine. See link in the question. – 7usam Feb 19 '13 at 11:37
  • @7usam you have problem with the link how you make it? what you try to find? – One Man Crew Feb 19 '13 at 11:40
  • This is my URL: https://www.google.com/search?hl=en&tbo=d&tbs=simg%3aCAESYxphCxCo1NgEGgQIBQgIDAsQsIynCBo4CjYIARIQ-QSMBeUEigSFBYwEiQWABRog8pwYCTxktmeGRsfQir52lJaebNtk-HopuZePSqpeh0gMCxCOrv4IGgoKCAgBEgSh_1cVaDA&q=flower%20&tbm=isch&sa=X&ei=7TsjUZnWNu3smAWDqIHQAg&ved=0CDsQsw4 encoding screws things up more – 7usam Feb 19 '13 at 11:54
  • I think the VERBOSE helped me! My client had changed something on their server and wasn't giving me a response. Thank you! :) – Andy Mar 21 '17 at 21:01
1

I suppose that have you noticed that your link is actually an HTTPS link.... It seems that CURL parameters do not include any kind of SSH handling... maybe this could be your problem. Why don't you try with a non-HTTPS link to see what happens (i.e Google Custom Search Engine)...?

George Vasiliou
  • 6,130
  • 2
  • 20
  • 27
  • Welcome to StackOverflow. You need to learn [how to write a good answer](http://stackoverflow.com/help/how-to-answer). You must visit [help center](http://stackoverflow.com/help) first. Although there is nothing wrong, but you are answering to a question that is asked nearly 2 years ago. – afzalex Aug 28 '14 at 23:29
1

Get content with Curl php

request server support Curl function, enable in httpd.conf in folder Apache


function UrlOpener($url)
     global $output;
     $ch = curl_init(); 
     curl_setopt($ch, CURLOPT_URL, $url); 
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
     $output = curl_exec($ch); 
     curl_close($ch);    
     echo $output;

If get content by google cache use Curl you can use this url: http://webcache.googleusercontent.com/search?q=cache:Put your url Sample: http://urlopener.mixaz.net/