1

I'm trying to simulate a real browser request using CURL with proxy rotate, I searched about it, But none of the answers worked.

Here is the code:

$url= 'https://www.stubhub.com/';
$proxy = '1.10.185.133:30207';
$userAgent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36';

$curl = curl_init();
curl_setopt( $curl, CURLOPT_URL, trim($url) );
curl_setopt($curl, CURLOPT_REFERER, trim($url));
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE );
curl_setopt( $curl, CURLOPT_CONNECTTIMEOUT, 0 );
curl_setopt( $curl, CURLOPT_TIMEOUT, 0 );
curl_setopt( $curl, CURLOPT_AUTOREFERER, TRUE );
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
$cacert='C:/xampp/htdocs/cacert.pem';
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
curl_setopt($curl, CURLOPT_COOKIEFILE,__DIR__."/cookies.txt");
curl_setopt ($curl, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookies.txt');
curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
curl_setopt( $curl, CURLOPT_USERAGENT, $userAgent );

//Headers
$header = array();
$header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$header[] = "Accept-Language: cs,en-US;q=0.7,en;q=0.3";
$header[] = "Accept-Encoding: utf-8";
$header[] = "Connection: keep-alive";
$header[] = "Host: www.gumtree.com";
$header[] = "Origin: https://www.stubhub.com";
$header[] = "Referer: https://www.stubhub.com";

curl_setopt( $curl, CURLOPT_HEADER, $header );
curl_setopt($curl, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
curl_setopt($curl, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($curl, CURLOPT_PROXY, $proxy);
curl_setopt($curl, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
$data = curl_exec( $curl );
$info = curl_getinfo( $curl );
$error = curl_error( $curl );

echo '<pre>';
print_r($all);
echo '</pre>';

Here is what I get when I run the script:

Array
(
    [data] => HTTP/1.1 200 OK

HTTP/1.0 405 Method Not Allowed
Server: nginx
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Cache-Control: private, no-cache, no-store, must-revalidate
Surrogate-Control: no-store, bypass-cache
Content-Length: 9411
X-EdgeConnect-MidMile-RTT: 203
X-EdgeConnect-Origin-MEX-Latency: 24
Date: Sat, 03 Nov 2018 17:15:56 GMT
Connection: close
Strict-Transport-Security: max-age=31536000; includeSubDomains

[info] => Array
        (
            [url] => https://www.stubhub.com/
            [content_type] => text/html; charset=UTF-8
            [http_code] => 405
            [header_size] => 487
            [request_size] => 608
            [filetime] => -1
            [ssl_verify_result] => 0
            [redirect_count] => 0
            [total_time] => 38.484
            [namelookup_time] => 0
            [connect_time] => 2.219
            [pretransfer_time] => 17.062
            [size_upload] => 0
            [size_download] => 9411
            [speed_download] => 244
            [speed_upload] => 0
            [download_content_length] => 9411
            [upload_content_length] => -1
            [starttransfer_time] => 23.859
            [redirect_time] => 0
            [redirect_url] => 
            [primary_ip] => 1.10.186.132
            [certinfo] => Array
                (
                )

            [primary_port] => 42150
            [local_ip] => 192.168.1.25
            [local_port] => 59320
        )

    [error] => 
)

As well as a Recaptcha, As it says:

Due to high volume of activity from your computer, our anti-robot software has blocked your access to stubhub.com. Please solve the puzzle below and you will immediately regain access.

When I visit the website using any browser, The website is displayed.

But with the above script, It's not.

So what am I missing to make the curl request like a real browser request and not be detected as a bot?

Or if there is an API/library that could do it, Please mention it.

Would Guzzle or similar fix this issue?

  • Well first off you have an extra `'` right here `$proxy = ''1.10.185.133:30207'` – ArtisticPhoenix Nov 03 '18 at 19:43
  • @ArtisticPhoenix, You are right, It's a typo error while copying/pasting the code, I updated the question –  Nov 03 '18 at 19:46
  • Possible duplicate of [Setting headers using CURL](https://stackoverflow.com/questions/53108139/setting-headers-using-curl) –  Nov 04 '18 at 20:24

1 Answers1

0

"So what am I missing to make the curl request like a real browser request"

My guess is they are using a simple cookie check. There are more sophisticated methods that allow recognizing automation such as cURL with a high degree of reliability, especially if coupled with lists of proxy IP addresses or IPs of known bangers.

Your first step is to intercept the outgoing browser request using pcap or something similar, then try and replicate it using cURL.

One other simple thing to check is whether your cookie jar has been seeded with some telltale. I routinely do that too, since most scripts on the Internet are just copy-pastes and don't pay much attention to these details.

The thing that would for sure make you bounce from any of my systems is that you're sending a referer, but you don't seem to actually have connected to the first page. You're practically saying "Well met again" to a server that is seeing you for the first time. You might have saved a cookie from that first encounter, and the cookie has now been invalidated (actually been marked "evil") by some other action. At least in the beginning, always replicate the visiting sequence from a clean slate.

You might try and adapt this answer, also cURL-based. Always verify actual traffic using a MitM SSL-decoding proxy.

Now, the real answer - what do you need that information for? Can you get it somewhere else? Can you ask for it explicitly, maybe reach an agreement with the source site?

LSerni
  • 55,617
  • 10
  • 65
  • 107
  • You really provided a lot of information here and with the other question, I viewed the object oriented code, But it seems it's not containing proxy functions and not sure it it would work with SSL websites –  Nov 04 '18 at 17:57
  • For the information from this website, I don't specifically mean that website, I came across some websites that are using similar services, So I took it as an example –  Nov 04 '18 at 18:01
  • @Bon You can modify the code to also work with SSL websites (I did). I've looked at my own Browser module to see whether I can decouple it sufficiently to make it useful as a Stack Overflow post. Unfortunately, it seems it has grown quite a bit since its inception, and it's not really a trivial task. Since it should be, I'll probably work on it -- but not quite now. If you could build on the basic module and maybe ask further questions, that would be best. My email is the obvious one - just add gmail dot com to my nickname. These days I can't promise I'll answer quickly, but I'll try. – LSerni Nov 04 '18 at 20:04