2

I've been making a PHP crawler that needs to get all links from a site and fire those links (instead of clicking it manually or doing client-side JS).

I have read these:

  1. How do I make a simple crawler in PHP?
  2. How do you parse and process HTML/XML in PHP?

and others more, and I decided to follow 1.

So far it has been working, but I have been baffled by the difference in the approach of using file_get_contents against dom->loadHTMLFile. Can you please enlighten me with these and the implications it might cause, pros and cons, or simple versus scenario.

Community
  • 1
  • 1
Rad Apdal
  • 442
  • 1
  • 6
  • 16
  • `file_get_contents` just gets you all the html that the target webpage contains, it knows no DOM. For DOM manipulation you still have to use the DOM related classes even if you get content via `file_get_contents` – Hanky Panky Jul 07 '14 at 16:02

1 Answers1

1

Effectively these method are doing the same. However, using file_get_contents() you will need to store the results, at least temporarily, in a string variable unless you pass it to DOMDocument::loadHTML(). This leads to a higher memory usage in your application.


Some sites may require you to set some special header values, or use an other HTTP method than GET. If you need this, you need to specify a so called stream context. You can achieve this for both of the above methods using stream_context_create():

Example:

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$ctx = stream_context_create($opts);

You can set this context using both of the above ways, but they differ in how to achieve this:

// With file_get_contents ...
$file_get_contents($url, false, $ctx);

// With DOM
libxml_set_streams_context($ctx);

$doc = new DOMDocument();
$doc->loadHTMLFile($url);    

Leaves to be said, that using the curl extension you will have even more control about he HTTP transfer, what might be necessary in some special cases.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266