Why file_get_contents returning garbled data?

Question

I am trying to grab the HTML from the below page using some simple php.

URL: https://kat.cr/usearch/architecture%20category%3Abooks/

My code is:

$html = file_get_contents('https://kat.cr/usearch/architecture%20category%3Abooks/');
echo $html;

where file_get_contents works, but returns scrambled data:

I have tried using cUrl as well as various functions like: htmlentities(), mb_convert_encoding, utf8_encode and so on, but just get different variations of the scrambled text.

The source of the page says it is charset=utf-8, but I am not sure what the problem is.

Calling file_get_contents() on the base url kat.cr returns the same mess.

What am I missing here?

http://stackoverflow.com/questions/11363022/get-url-content-php Check this out. — Jacob Mathison, Aug 10 '15 at 21:14
See: [How can I read GZIP-ed response](http://stackoverflow.com/q/8581924/55075) — kenorb, Aug 10 '15 at 21:17

score 3 · Answer 1 · answered Aug 10 '15 at 21:22

3

It is GZ compressed and when fetched by the browser the browser decompresses this, so you need to decompress. To output it as well you can use readgzfile():

readgzfile('https://kat.cr/usearch/architecture%20category%3Abooks/');

answered Aug 10 '15 at 21:22

AbraCadaver

78,200
7
66
87

Thanks! Simple and effective. – ian Aug 11 '15 at 02:42

kenorb · Accepted Answer · 2015-08-11T08:23:42.680

Your site response is being compressed, therefore you've to uncompress in order to convert it to the original form.

The quickest way is to use gzinflate() as below:

$html = gzinflate(substr(file_get_contents("https://kat.cr/usearch/architecture%20category%3Abooks/"), 10, -8));

Or for more advanced solution, please consider the following function (found at this blog):

function get_url($url)
{
    //user agent is very necessary, otherwise some websites like google.com wont give zipped content
    $opts = array(
        'http'=>array(
            'method'=>"GET",
            'header'=>"Accept-Language: en-US,en;q=0.8rn" .
                        "Accept-Encoding: gzip,deflate,sdchrn" .
                        "Accept-Charset:UTF-8,*;q=0.5rn" .
                        "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn"
        )
    );

    $context = stream_context_create($opts);
    $content = file_get_contents($url ,false,$context); 

    //If http response header mentions that content is gzipped, then uncompress it
    foreach($http_response_header as $c => $h)
    {
        if(stristr($h, 'content-encoding') and stristr($h, 'gzip'))
        {
            //Now lets uncompress the compressed data
            $content = gzinflate( substr($content,10,-8) );
        }
    }

    return $content;
}

echo get_url('http://www.google.com/');

Why file_get_contents returning garbled data?

2 Answers2