10

I am trying to decode the webpage www.dealstan.com using CURL by using the below code:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");     
curl_setopt($ch, CURLOPT_TIMEOUT,5); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

$return = curl_exec($ch); 
$info = curl_getinfo($ch); 
curl_close($ch); 

$html = str_get_html("$return");
echo $html;

but, it is showing some junk charater

"��}{w�6����9�X�n���.........." for about 100 lines.

I tried to find the response in hurl.it, found one interesting point, it looks like the html is encoded twice(just a guess, based on the response)

Find the response below: GET http://www.dealstan.com/

200 OK 18.87 kB 490 ms View Request View Response HEADERS

Cache-Control: max-age=0, no-cache

Cf-Ray: 18be7f54f8d80f1b-IAD

Connection: keep-alive

Content-Encoding: gzip, gzip ==============>? suspecting this, anyone know about it?

Content-Type: text/html; charset=UTF-8

Date: Wed, 19 Nov 2014 18:33:39 GMT

Server: cloudflare-nginx

Set-Cookie: __cfduid=d1cff1e3134c5f32d2bddc10207bae0681416422019; expires=Thu, 19-Nov-15 18:33:39 GMT; path=/; domain=.dealstan.com; HttpOnly

Transfer-Encoding: chunked

Vary: Accept-Encoding

X-Page-Speed: 1.8.31.2-3973

X-Pingback: http://www.dealstan.com/xmlrpc.php

X-Powered-By: HHVM/3.2.0 BODY view raw

H4sIAAAAAAAAA5V8Q5AoWrBk27Ztu/u2bdu2bdu2bdu2bds2583f/pjFVOQqozZnUxkVJ7PwoyAA/qeAb3y83LbYHs/3Hv79wKm/2N5cZyJVtCWu1xyteyzLNqYuWbdtHeELCyIZRRp/1Fe7es3+wL3Vfb

anyone knows how to decode the response with the header "Content-Encoding: gzip, gzip",

That site is loading properly in firefox, chrome etc. but, i am not able to decode using CURL.

Please help me to decode this issue?

stackguy
  • 478
  • 1
  • 5
  • 14
  • In google, found one bug which is reported in mozilla for the similar issue, https://bugzilla.mozilla.org/show_bug.cgi?id=205156, but i could not find any patch for that bug, since the site is loading properly in firefox, they should have solved this issue – stackguy Nov 19 '14 at 19:11
  • Odd. The junk is exactly what's coming back—it shows that way in Safari, too. So it's basically sending back the page gzipped, even though it claims that the Content-Type is text/html. (Is it meant to look like that? Looks to me like their website is just broken. It shows, as I'd expect, the textual representation of the GZIP data if I browse there in Safari...) NB: It seems to be gzipping it in transit, and *also* sending a gzipped version of the page, so I needed to gunzip it *twice* to see the actual HTML. – Matt Gibson Nov 19 '14 at 21:36
  • Just checked a couple of other browsers—Firefox and Chrome successfully show me the webpage; Opera and Safari show me raw gzip data. So, I'd say that the website is misconfigured and is gzipping the page twice, but that some web browsers are detecting this brokenness and decoding it twice for you. I'm not sure I'd rely on it always being broken like that, as sooner or later they're going to realise that their website is broken in some major browsers, and fix the configuration... – Matt Gibson Nov 19 '14 at 21:45
  • As you said, they solved the issue, now, i am able to parse it without any issues. Anyway if we come to know how firefox is able to handle it properly, that will help us for solving the similar issue in future. – stackguy Nov 20 '14 at 07:25

1 Answers1

8

You can decode it by trimming off the headers and using gzinflate.

$url = "http://www.dealstan.com"

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING, "gzip");     
curl_setopt($ch, CURLOPT_TIMEOUT, 5); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

$return = curl_exec($ch); 
$info = curl_getinfo($ch); 
curl_close($ch); 

$return = gzinflate(substr($return, 10));
print_r($return);
Nalin Singapuri
  • 463
  • 4
  • 6
  • 2
    Yup, I'd say this is the way to go. This method is actually unzipping the content twice, as Curl will be unzipping it once, and then you're unzipping it again manually. But you may want to check the response *before* you manually unzip it (the first two bytes in the response will be 1f 8b if it's still gzipped), as at some point this website will surely get some complaints from Safari, Opera, etc., users and fix the configuration problem that's leading to the doubly-encoded content... – Matt Gibson Nov 19 '14 at 22:05
  • I modified the answer to the actual snipped I tested (I dont have str_get_html). Is the print_r($return) there correct? see also http://stackoverflow.com/a/4841712/3922511 which contains is a more versatile function. – Nalin Singapuri Nov 19 '14 at 22:35