12

I receive an html string using curl:

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html_string = curl_exec($ch);

When I echo it I see a perfectly good html as I require for my parsing needs. But, When trying to send this string to HTML DOM PARSER method str_get_html($html_string), It would not upload it (returns false from the method invocation).

I tried saving it to file and opening with file_get_html on the file, but the same thing occurs.

What can be the cause of this? As I said, the html looks perfectly fine when I echo it.

Thanks a lot.

The code itself:

$html = file_get_html("http://www.bgu.co.il/tremp.aspx");
$v = $html->find('input[id=__VIEWSTATE]');
$viewState = $v[0]->attr['value'];
$e = $html->find('input=[id=__EVENTVALIDATION]');
$event = $e[0]->attr['value'];

$html->clear(); 
unset($html);

$body = " A_STRING_THAT_CONTAINS_SOME_DATA " 

$ch = curl_init("http://www.bgu.co.il/tremp.aspx");
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$html_string = curl_exec($ch);

$file_handle = fopen("file.txt", "w");
fwrite($file_handle, $html_string);
fclose($file_handle);

curl_close($ch);

$html = str_get_html($html_string);
Dani
  • 121
  • 1
  • 1
  • 4

3 Answers3

43

You curl link seems have many element(large file).

And I am parsing a string(file) as large as your link and encounter this problem.

After I saw the source code, I found the problem. It works for me !


I found that simple_html_dom.php have limit the size you read.

// get html dom from string
  function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_B     R_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
  {
           $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
           if (empty($str) || strlen($str) > MAX_FILE_SIZE)
           {
                   $dom->clear();
                   return false;
           }
           $dom->load($str, $lowercase, $stripRN);
           return $dom;
  }

you must to change the default size below (It's on the top of the simple_html_dom.php)
maybe change to 100000000 ? it's up to you.

define('MAX_FILE_SIZE', 6000000); 
twxia
  • 1,813
  • 1
  • 15
  • 25
1

Did you check if the HTML is somehow encoded in a way HTML DOM PARSER doesn't expect? E.g. with HTML entities like &lt;html&gt; instead of <html> – that would still be displayed as correct HTML in your browser but wouldn't parse.

florian h
  • 1,162
  • 1
  • 10
  • 24
  • I saved the string to file and looked at it with notepad. The tags (and the entire html) looks perfectly valid. – Dani Jan 05 '13 at 14:33
0

I asume that you are using curl + str_get_html instead of simply using file_get_html with the URL because of the POST parameters you need to send.

You can use this W3C validator (http://validator.w3.org/#validate_by_input+with_options) to validate the returned HTML, then, once you are sure the result is a 100% valid HTML code you can report a bug here: http://sourceforge.net/p/simplehtmldom/bugs/.

FerCa
  • 2,067
  • 3
  • 15
  • 18
  • Well, I used the validator and received errors for the returned HTML. Funny thing is when I take the source page of the HTML I aim to work on with a web browser and try to validate it, I receive errors as well. So unfortunately that doesn't help me. If the returned HTML page uploads properly when echoing it, isn't that supposed to be enough? – Dani Jan 05 '13 at 15:33
  • Well, this means that the page you are trying to parse is not valid HTML, what are the errors BTW? Anyway you can try to report a bug to PHP HTML DOM parser project, but if the html code you are trying to parse is not really valid HTML I'm not sure if you will get this fixed. – FerCa Jan 05 '13 at 22:38