Getting garbage output when scraping a webpage in PHP

Question

I am trying to get the contents of a page from Amazon using file_get_html() but the output comes with weird characters on echo. Can anyone please explain how can I resolve this issue?

I also found the following two related questions on Stack Overflow but they did not solve my issue. :)

Here is my code:

$options = array(
'http'=>array(
    'header'=>
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
            "Accept-language: en-US,en;q=0.5\r\n" .
            "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n"
   )
); 
$context = stream_context_create($options);

$amazon_url = 'https://www.amazon.com/my-url';
$amazon_html = file_get_contents($amazon_url, false, $context);

Here is the output I get:

��T]o�6}��`���0��݊-��"[�bh�tN�b0��.%%�$P��@�(Ų�� ������F#����A�

about 115k characters like this show up in the browser window.

These are my new headers:

$options = array(
'http'=>array(
    'header'=>
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n".
            "Accept-language: en-US,en;q=0.5\r\n"
   )
);

Will using cURL resolve this issue?

Update:

I tried cURL. Still getting the garbage output. Here are my response headers:

HTTP/1.1 200 OK
Date: Sun, 18 Nov 2018 20:29:28 GMT
Server: Apache/2.4.33 (Win32) OpenSSL/1.1.0h PHP/7.2.5
X-Powered-By: PHP/7.2.5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

Can anyone explain the negative votes?

I did a research myself.
Found some related questions on Stack Overflow which did not solve my problem.
Provided all the information that I thought would be helpful.

What else should I include in the question?

Here is my whole code for curl at present. This is the URL I am scraping.

$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, $amazon_url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($handle);
curl_close($handle);

echo $data;

The output is just a bunch of characters I mentioned above. Here are my request headers:

Host: localhost
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: AMCV_17EB401053DAF4840A490D4C%40AdobeOrg=-227196251%7CMCIDTS%7C17650%7CMCMID%7C67056225185486460220940124683302119708%7CMCAID%7CNONE%7CMCOPTOUT-1524907071s%7CNONE; mjx.menu=renderer%3ACommonHTML; _ga=GA1.1.2019605490.1529649408; csm-hit=adb:adblk_no&tb:s-3521C4J8F2EP1V0MMQEP|1542578145652&t:1542578146256
Upgrade-Insecure-Requests: 1
Pragma: no-cache
Cache-Control: no-cache

These are from the Network Tab. The response headers are the same as I mentioned above.

Here is the output after adding curl_setopt($handle, CURLOPT_HEADER, 1); to my code:

HTTP/1.1 200 OK Server: Server Content-Type: text/html; charset=UTF-8 Strict-Transport-Security: max-age=47474747; includeSubDomains; preload x-amz-id-1: 7A162B8JKV6MGZQ3PCH2 Vary: Accept-Encoding,User-Agent,X-Amzn-CDN-Cache Content-Encoding: gzip x-amz-rid: 7A162B8JKV6MGZQ3PCH2 Cache-Control: no-transform X-Frame-Options: SAMEORIGIN Date: Sun, 18 Nov 2018 22:42:51 GMT Transfer-Encoding: chunked Connection: keep-alive Connection: Transfer-Encoding Set-Cookie: x-wl-uid=1a4u8+XgF+IhFF/iavy9mKZCAA0g4HiIYZXR8hKjxGtmOtBW+j67wGABv7ZOTxDRcab+7Qmpjqds=;

We're known to be a bit hasty on the downvotes, in your case I don't believe that was well deserved. Your code seems fine at first blush, if you load the same Amazon URL in your browser, it's a plain text / HTML file with proper output? — sheng, Nov 18 '18 at 22:02
It does seem like encoding issue, although it's pretty strange, since Amazon should return by default UTF-8, and your script (these headers are what your script returns?) also seems to be returning UTF-8 string, you're running latest PHP so that's not a problem either. Can you show us the whole actual cURL code and the exact URL you're trying to echo for a test? — p0358, Nov 18 '18 at 22:03
This is not limited to any particular URLs. Also, it seems to happen "randomly". Sometimes, it happens once in a while. Other times, it keeps happening again and again. — Real Noob, Nov 18 '18 at 22:05
Well, the code looks fine to me, I have also checked it on similiar environment (Windows) and I can successfully view Amazon page in my browser using it. You mentioned that it seems to happen randomly, which means it's probably some more complicated issue. Do you have any PHP errors in logs by chance? — p0358, Nov 18 '18 at 22:25
@p0358 When you said "I can successfully view Amazon page in my browser using it". Did you mean that the above code echoed the page properly for you? — Real Noob, Nov 18 '18 at 22:28
Yes, I used the wording to mean that it even did load all assets, so the page displayed correctly in the browser as well, just as if it was the actual original page — p0358, Nov 18 '18 at 22:29
@p0358 I am running the code on localhost and the browser display no PHP errors just the garbage output. :) — Real Noob, Nov 18 '18 at 22:31
Browser yes, but depending on your config the errors may be saved to a separate log file. I also have an assumption that the data you retrieve may not be text/html, which would require a little fix in the code to display properly in browser then. (I'll post example in a second) — p0358, Nov 18 '18 at 22:34
For now, append `curl_setopt($handle, CURLOPT_HEADER, 1);` somewhere before curl_exec so that the response will contain the response headers from Amazon, and then we will be able to see if my assumption is the case possibly — p0358, Nov 18 '18 at 22:37
@p0358, I am adding the output I received to the question details. :) — Real Noob, Nov 18 '18 at 22:43
@p0358 I apologize for the delay in responding. I restarted my laptop and ran my original code again because doing this gave correct output once earlier. It did not work this time though. :D — Real Noob, Nov 18 '18 at 22:46
I think I have ran out if ideas, both MIME type and encoding seems to match. When looking further for it, keep in mind the code is fine (worked for me), so the issue has to lie somewhere else. (maybe config, maybe something with network, maybe Amazon actually responds with garbage as a measure of rate-limiting? no idea) — p0358, Nov 18 '18 at 22:50
I had the same issue, and the single answer on to this question, below, solved it for me. Itʼs a compression issue, not an encoding one. A Zip file will look like a lot of garbage if you view it raw without uncompressing it. — Brian Tristam Williams, May 19 '21 at 04:26

score 6 · Accepted Answer · answered Dec 19 '18 at 19:22

6

Here's the solution:

I ran into the same issue when scraping Amazon. Simply add the following option before sending your cURL request:

curl_setopt($handle, CURLOPT_ENCODING, 'gzip,deflate,sdch');

answered Dec 19 '18 at 19:22

jeff

345
3
9

1

Lifesaver, thanks! I knew it had something to do with compression. After running fine for years, after 2021-05-18T18:00:00Z, it happens randomly on IMDb scrapes - sometimes a page will work; sometimes it will return compressed garbage. Doesnʼt seem to happen since I added your line. – Brian Tristam Williams May 19 '21 at 04:24

Getting garbage output when scraping a webpage in PHP

1 Answers1

Linked

Related