simple_html_dom doesn't work for some sites

Question

simple_html_dom doesn't work for some website and return Unknown code:

$html = file_get_html('http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035');
echo $html;

And result is some thing like below:

�D�}R��][��ƕ~OU�̇p�����" gK�e[�8+[���)� B3t8䘄F�8�Z[7�ʿ�/rT�'����K~i��/�s��0��h��>���ڷ�7�����8��������(l��Eq������;��V������u�tƝ[ݨ���{qԋ[�kW[Q� j��ĝ���n\�{�ʅ��p�=�����#���??�����I�����s�޾�ۏ;������?<���$xݓV��vo��AxQ|-��6'7oƧ��R|�s�ۀ��ޝn��ӟ�����ǭ^t����߼��|O4�76/�?��Qo���ս��5�at¶�p���� ����-n5�9o6u����Ŀv�Q�v

What can I do to fix this problem?

Did you check this: http://stackoverflow.com/questions/12351776/character-encoding-issue-with-php-simple-html-dom-parser — Enissay, Nov 08 '13 at 23:23
I just edited my answer, I think I found the real problem here... — Adam D. Ruppe, Nov 09 '13 at 17:04

Adam D. Ruppe · Accepted Answer · 2013-11-09T17:04:37.847

2

The root problem here (at least on my computer, maybe different with your version...) is that site returns gzipped data, and it isn't being uncompressed properly by php and curl before being passed to the dom parser. If you are using php 5.4, you can use gzdecode and file_get_contents to uncompress it yourself.

On older php versions, this code will work:

<?php
    // download the site
    $data = file_get_contents("http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035");
    // decompress it (a bit hacky to strip off the gzip header)
    $data = gzinflate(substr($data, 10, -8));
    include("simple_html_dom.php");
    // parse and use
    $html = str_get_html($data);
    echo $html->root->innertext();

Note that this hack will not work on most sites. The main reason underlying this seems to me that curl doesn't announce that it accepts gzip data... but the web server on that domain doesn't pay attention to that header, and gzips it anyway. Then neither curl nor php actually checks the Content-Encoding header on the response, and assumes it isn't gzipped so it passes it through without an error nor calling gunzip. Bugs in both the server and the client here!

For a more robust solution, maybe you can use curl to get the headers and inspect them yourself to determine if you need to decompress it. Or you can just use this hack for this site and the normal method for others to keep things simple.

It might still also help to set the character encoding on your output. Add this before you echo anything to ensure the data you read isn't recorrupted in the user's browser by being read as the wrong charset:

header('Content-Type: text/html; charset=utf-8');

edited Nov 09 '13 at 17:04

answered Nov 08 '13 at 23:08

Adam D. Ruppe

25,382
4
41
60

using the same url as your question or a different one? – Adam D. Ruppe Nov 09 '13 at 00:09
use same url in question – MOB Nov 09 '13 at 09:31
I found the root problem: it is actually file_get_contents doesn't gunzip the data. So it is trying to parse compressed data as html and that is coming out as garbage. php 5.4 has a gzip decode function built in, but not the previous versions.... – Adam D. Ruppe Nov 09 '13 at 16:57
thanx a lot . there is another problem ... this code dont show the source of this url it's show a main page url .. below you can see the source that code get ... but it's not a target page source ... we call to get source of : http://www.tsetmc.com/loader.aspx?ParTree=151311&i=49776615757150035 and recive code of some thing like : http://www.tsetmc.com/Loader.aspx?ParTree=15131F what's the issue :( ? – MOB Nov 09 '13 at 18:23
for example class "box1 red tbl zi1_4 h110" is not in get source with php ... but it's exist in orginal page :( – MOB Nov 09 '13 at 18:26
I didn't even notice that at first because I keep javascript disabled in my browser most the time... but that's the cause: the other code is added by javascript and the php dom doesn't run it like the browser does. – Adam D. Ruppe Nov 09 '13 at 18:39
I don't know, maybe try searching the web for a screen scraper with javascript support. – Adam D. Ruppe Nov 09 '13 at 21:01

simple_html_dom doesn't work for some sites

1 Answers1