59

There are websites that when I open specific ajax request on browser, I get the resulted page. But when I try to load them with curl, I receive an error from the server.

How can I properly emulate a get request to the server that will simulate a browser?

That is what I am doing:

$url="https://new.aol.com/productsweb/subflows/ScreenNameFlow/AjaxSNAction.do?s=username&f=firstname&l=lastname";
ini_set('user_agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
print $result;
peterh
  • 11,875
  • 18
  • 85
  • 108
ufk
  • 30,912
  • 70
  • 235
  • 386

2 Answers2

86

Are you sure the curl module honors ini_set('user_agent',...)? There is an option CURLOPT_USERAGENT described at http://docs.php.net/function.curl-setopt.
Could there also be a cookie tested by the server? That you can handle by using CURLOPT_COOKIE, CURLOPT_COOKIEFILE and/or CURLOPT_COOKIEJAR.

edit: Since the request uses https there might also be error in verifying the certificate, see CURLOPT_SSL_VERIFYPEER.

$url="https://new.aol.com/productsweb/subflows/ScreenNameFlow/AjaxSNAction.do?s=username&f=firstname&l=lastname";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
var_dump($result);
VolkerK
  • 95,432
  • 20
  • 163
  • 226
  • 5
    Awesome. I was stuck on the same. SSL_VERIFYPEER=false did the trick. Thanks! – beetree Oct 20 '11 at 15:26
  • what do i have to change if i want to simulate latest firefox? – Sisir Jul 17 '12 at 14:37
  • 1
    Sisir: see http://stackoverflow.com/questions/1457380/is-there-an-online-user-agent-database – VolkerK Jul 24 '12 at 07:21
  • Hey, how about if I want to access from 'front' page then to the 'target' page (same domain), I dont know why but when I access the 'target' page directly it will response : 'TRY TO MAKE ROBOT?'. But when I access the 'front' first (with browser), the response is normal. – Anggie Aziz Dec 11 '13 at 07:03
  • 1
    Did u save cookies to use on the second url step? `curl_setopt($ch, CURLOPT_COOKIEJAR, 'file or path');` to set on 1st step and `curl_setopt($ch, CURLOPT_COOKIEFILE, 'file or path');` to read on 2nd step. Maybe you need use referer too, like `curl_setopt($ch, CURLOPT_REFERER, true);` in that case, you can use domain (or IP). – m3nda May 01 '14 at 08:38
  • Setting the cookie jar helped me with my problem. – GTCrais Dec 27 '18 at 23:51
  • @VolkerK You solution is too good. I was found since 4 hours then I hot it here. – Bhavin Thummar Jan 22 '21 at 12:01
15

i'll make an example, first decide what browser you want to emulate, in this case i chose Firefox 60.6.1esr (64-bit), and check what GET request it issues, this can be obtained with a simple netcat server (MacOS bundles netcat, most linux distributions bunles netcat, and Windows users can get netcat from.. Cygwin.org , among other places),

setting up the netcat server to listen on port 9999: nc -l 9999

now hitting http://127.0.0.1:9999 in firefox, i get:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1

now let us compare that with this simple script:

<?php
$ch=curl_init("http://127.0.0.1:9999");
curl_exec($ch);

i get:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
Accept: */*

there are several missing headers here, they can all be added with the CURLOPT_HTTPHEADER option of curl_setopt, but the User-Agent specifically should be set with CURLOPT_USERAGENT instead (it will be persistent across multiple calls to curl_exec() and if you use CURLOPT_FOLLOWLOCATION then it will persist across http redirections as well), and the Accept-Encoding header should be set with CURLOPT_ENCODING instead (if they're set with CURLOPT_ENCODING then curl will automatically decompress the response if the server choose to compress it, but if you set it via CURLOPT_HTTPHEADER then you must manually detect and decompress the content yourself, which is a pain in the ass and completely unnecessary, generally speaking) so adding those we get:

<?php
$ch=curl_init("http://127.0.0.1:9999");
curl_setopt_array($ch,array(
        CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
        CURLOPT_ENCODING=>'gzip, deflate',
        CURLOPT_HTTPHEADER=>array(
                'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language: en-US,en;q=0.5',
                'Connection: keep-alive',
                'Upgrade-Insecure-Requests: 1',
        ),
));
curl_exec($ch);

now running that code, our netcat server gets:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Connection: keep-alive
Upgrade-Insecure-Requests: 1

and voila! our php-emulated browser GET request should now be indistinguishable from the real firefox GET request :)

this next part is just nitpicking, but if you look very closely, you'll see that the headers are stacked in the wrong order, firefox put the Accept-Encoding header in line 6, and our emulated GET request puts it in line 3.. to fix this, we can manually put the Accept-Encoding header in the right line,

<?php
$ch=curl_init("http://127.0.0.1:9999");
curl_setopt_array($ch,array(
        CURLOPT_USERAGENT=>'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0',
        CURLOPT_ENCODING=>'gzip, deflate',
        CURLOPT_HTTPHEADER=>array(
                'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language: en-US,en;q=0.5',
                'Accept-Encoding: gzip, deflate',
                'Connection: keep-alive',
                'Upgrade-Insecure-Requests: 1',
        ),
));
curl_exec($ch);

running that, our netcat server gets:

$ nc -l 9999
GET / HTTP/1.1
Host: 127.0.0.1:9999
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1

problem solved, now the headers is even in the correct order, and the request seems to be COMPLETELY INDISTINGUISHABLE from the real firefox request :) (i don't actually recommend this last step, it's a maintenance burden to keep CURLOPT_ENCODING in sync with the custom Accept-Encoding header, and i've never experienced a situation where the order of the headers are significant)

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • 1
    Thank you very much, I Appreciate your answer! `Upgrade-Insecure-Requests: 1` fixed my problems – sergej Jan 25 '20 at 12:40