6

I am fetching some info via PHP from a webpage using simple_php_dom and curl. The problem is that the page is not built correctly so the DOM object contains erroneous info.

How can I get the HTML file as a string in a PHP var so that I can run a regular expression through it?

Curl doesn't work as it is ignoring the bad part.
simple_html_dom.php has the same issue.
wget doesn't work since I don't have permissions for it on the server.

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
fmsf
  • 36,317
  • 49
  • 147
  • 195

3 Answers3

13

file_get_contents — Reads entire file into a string

string file_get_contents ( 
    string $filename [, int $flags= 0 [, resource $context [, int $offset= -1 [, int $maxlen= -1 ]]]] 
)

from the manual:

This function is similar to file(), except that file_get_contents() returns the file in a string, starting at the specified offset up to maxlen bytes. On failure, file_get_contents() will return FALSE.

file_get_contents() is the preferred way to read the contents of a file into a string. It will use memory mapping techniques if supported by your OS to enhance performance.

And it works both with webpages and files. You can grab the HTML, just by using "http://whatever.com/page.html" as $filename.

Joey
  • 344,408
  • 85
  • 689
  • 683
Gerrit
  • 1,579
  • 5
  • 16
  • 23
  • 2
    Only works if allow_url_fopen is enabled, though. There's really no reason that this should work with curl as well. – Emil H Jul 29 '09 at 22:58
  • 1
    It also ignores part of the file :S The only one so far that really gets the file correctly is wget, which i can't use :S – fmsf Jul 29 '09 at 23:05
  • You can test it with this: http://www.zonlusomundo.pt/txt_geral.php?Gid=1579830&zona=filme compare the size of wget and the size with that – fmsf Jul 29 '09 at 23:06
  • Maybe the site has some kind of user-agent policy. How does wget identify itself to the site? If you pull google.com, it (usualy) loads perfectly with both file_get_contents and wget. Maybe you need to set a user-agent string for file_get_contents() – Gerrit Aug 07 '09 at 15:55
4

With curl you would want to make sure that you're setting the CURLOPT_RETURNTRANSFER parameter to ensure that the page is retrieved as a string, e.g.:

    //return the transfer as a string 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

See http://www.php.net/manual/en/function.curl-setopt.php

karim79
  • 339,989
  • 67
  • 413
  • 406
0

I used cURL to get the file into a string (simple_html_dom::load_file just wraps file_get_contents) then using simple_html_dom load (from string) method to parse it. That works for some URL's but it is failing in this case when the URL has a parameter string. It is fetching the URL as if it had not a parameter string. I set an agent with curl to impersonate a browser but no dice.

Sorry this is not an answer really, but maybe using curl will work for some people for whom the fopen setting is a problem.

Colleen Kitchen
  • 1,069
  • 1
  • 11
  • 20