1

Recently, I want to scraping a website using CURL PHP. And the problem come. It return weird string combination and symbol. I really confused about it. I have set the encoding, both in header and declared it in curlopt. Here is the coding I used to scrap.

$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
//curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate,br');
curl_exec($ch);
curl_close($ch);

And this is the header I sent :

$header = [
    ':authority: www.airpaz.com',
    ':method: GET',
    ':path: $path,
    ':scheme: https',
    'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding: gzip, deflate, br',
    'accept-language: en-US,en;q=0.9',
    'cache-control: max-age=0',
    'referer: $referer',
    'upgrade-insecure-requests: 1',
    'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
];

When I run it, it return exactly like the image below :

enter image description here

Can anyone tell what's the problem is? Thanks for your time. It will help me a lot

Mike Doe
  • 16,349
  • 11
  • 65
  • 88
  • 1
    `accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8`. The last bit `*/*` says: unless you cannot serve me `text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng` I'll accept everything. And you got gzip because you accept it. Either don't, or ungzip the received content. If the server does not negotiate content correctly, you WILL HAVE TO ungzip the content. – Mike Doe Oct 18 '18 at 08:16
  • Then, which part of the code I should change? – BOBBY IRAWAN Oct 18 '18 at 08:17
  • Isn't this obvious already? If you don't wish to accept *everything* drop the `*/*;q=0.8` from the `Accept` header. – Mike Doe Oct 18 '18 at 08:19
  • I get that from the request header of the website. So, I think it should not be delete – BOBBY IRAWAN Oct 18 '18 at 08:21
  • You clearly miss the point of the Accept header. – Mike Doe Oct 18 '18 at 08:24
  • Still not understand yet. Can u give me a reference for me to learn it? – BOBBY IRAWAN Oct 18 '18 at 08:29
  • https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept , https://stackoverflow.com/questions/5331452/http-accept-header-meaning – Mike Doe Oct 18 '18 at 08:31
  • I have delete it, but still return like in the picture – BOBBY IRAWAN Oct 18 '18 at 08:37

0 Answers0