1

I have a website scraping project. Look at this code:

<?php
include('db.php');
$r = mysql_query("SELECT * FROM urltable");
$rows=  mysql_num_rows($r);
for ($j = 0; $j <$rows; ++$j) {
$row = mysql_fetch_row($r);
$html = file_get_contents(mysql_result($r,$j,'url'));
$file = fopen($j.".txt", "w");
fwrite($file,$html);
fclose($file);
}
?>

I have a list of url. This code means that, make text files using the contents(HTML) from each url.

When running this code, I can make only one file per second [each file size~ 20KB]. My internet is providing 3mbps downloading speed, but I can't utilize that speed with this code.

How do I speed up file_get_contents()? Or how do I speed up this code using threading or configure php.ini file or any other methods?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
  • Hi, for future reference, always remember to Google first! Searching for `php speed up file_get_contents()` returns a bunch of useful results, including many here on Stack Overflow. – Pekka Jul 13 '13 at 08:48
  • 1
    possible duplicate of [PHP file\_get\_contents very slow when using full url](http://stackoverflow.com/questions/3629504/php-file-get-contents-very-slow-when-using-full-url) – Pekka Jul 13 '13 at 08:48
  • 1 sec limit may be related to DNS and thus unsolvable. See http://stackoverflow.com/q/7987584/258674 – dev-null-dweller Jul 13 '13 at 08:53

2 Answers2

0

As this was not one of the suggestions on the duplicate page I will add it here.

Take a close look at Curl Multi PHP Manual page.

Its not totally straight forward but once you get it running its very fast. Basically you issue multiple curl requests and then collect the data returned as and when it returns. It returns in any order so there is a bit of control required. I have used this on a data collection process to reduce 3-4 hours of processing to 30 minutes.

The only issue could be that you swamp a site with multiple requests and the owner considers that an issue and bans your access. But with a bit of sensible sleep()'ing added to your process you should be able to reduce that possibility to a minimum.

RiggsFolly
  • 93,638
  • 21
  • 103
  • 149
  • Not overloading one site is the main issue here. If there are multiple domains involved, the request to these should be randomized, e.g. do not walk through the first domain, then the second etc, but mix them. That way, the performance will be also improved. – Sven Jul 13 '13 at 11:24
0

You can add few controls with the streams. But cURL should be much better, if available.

$stream_options = array(
    'http' => array(
    'method' => 'GET',
    'header' => 'Accept-language: en',
    'timeout' => 30,
    'ignore_errors' => true,
));
$stream_context = stream_context_create($stream_options);
$fc = file_get_contents($url, false, $stream_context);
Bimal Poudel
  • 1,214
  • 2
  • 18
  • 41