0

I'm trying to download multiple files which resides in a folder with sequential names, i.e 1.html, 2.html, 3.html, 9999.html

What will be the best way to read/ process the HTML files using PHP ?

[The file will be used by DOMXPath also though!]

Below are the code of the UI

<html lang="en">
<head>
<meta charset="utf-8"/>
<script type="text/javascript" src="jquery.js"></script>
<script type="text/javascript">
$(document).ready(function() {
$('#btn').click(function() {
    $('#p3').val(0);$('#p2').val(0);
    var i;
    $('#p1').val(parseInt($('#st').val()));
    for (i = parseInt($('#st').val()); i < parseInt($('#en').val()); i++) {
        var jqxhr = $.post("downloader.php", { 'id':i }, function() {
            })
            .always(function(data) {
                if (data != 0)
                    $('#p2').val(parseInt($('#p2').val()) + 1);
                else
                    $('#p3').val(parseInt($('#p3').val()) + 1);
                $("#txt").val($("#txt").val() + "\n" + data);
                $('#p1').val(parseInt($('#p1').val()) + 1);
            });
    }

});
});
</script>
</head>
<body>
<form name="frm" id="frm">
Start from <input type="text" name="st" id="st" /> To <input type="text" name="en" id="en" /> <hr/>
Processing <input type="text" name="p1" id="p1" /> <br/>
Processed <input type="text" name="p2" id="p2" /> <br/>
Not found <input type="text" name="p3" id="p3" /> <br/>
<input type="button" id="btn" value="Start" />
</form>
<textarea id="txt" name="txt"></textarea>
</body>
</html>

The background crawler

<?php
error_reporting(0);
$id = 0;
$id = intval($_POST['id'])+1;
$url = 'https://remote.server/'.$id.'.html';

//$html = curl_get_contents($url);

if (!$html = @file_get_contents($url)) {    echo 0;     }

//some processing of the data
$data = (new DOMXPath ( (@DOMDocument::loadHTML ( $html )) ))->query ( '//span[@class="data"]' )->item ( 1 )->textContent;

$data2 = (new DOMXPath ( (@DOMDocument::loadHTML ( $html )) ))->query ( '//span[@class="data2"]' )->item ( 0 )->textContent; 

/*insertion of data
$dba_host='p:localhost'; $dba_name='root'; $dba_pass=''; $dba_db='db'; $con=mysqli_connect($dba_host,$dba_name,$dba_pass,$dba_db) or die('Connection Refused !');
$stmt = mysqli_prepare($con,"INSERT INTO `tbl` *,*) VALUES (?,?)");
mysqli_stmt_bind_param($stmt,"ss", *, *);  mysqli_stmt_execute($stmt);
mysqli_stmt_close($stmt);  mysqli_close($con);
*/

function curl_get_contents ($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, True);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/61.0');
$return = curl_exec($curl);
curl_close($curl);
return $return;
}

echo 1;
?>

As of now the performance looks really slow. How to improve/optimize the code ?

Sourav
  • 17,065
  • 35
  • 101
  • 159
  • 1
    You can make use of [`curl_multi_exec()`](http://php.net/manual/en/function.curl-multi-exec.php) to download multiple files at once. – MonkeyZeus Dec 20 '18 at 18:11
  • 1
    1) Put the loop on the server side so there's only one AJAX hit. 2) Pre-process each file and store the results so you don't have to re-parse them for every request. – Alex Howansky Dec 20 '18 at 18:11
  • Possible duplicate of [understanding php curl\_multi\_exec](https://stackoverflow.com/questions/15559157/understanding-php-curl-multi-exec) – MonkeyZeus Dec 20 '18 at 18:12
  • as @MonkeyZeus pointed to, the use of multiple concurrent connections with curl_multi, and also compression, should massively speed up this script, yup. – hanshenrik Dec 21 '18 at 00:14

1 Answers1

1
  1. use the curl_multi api to download the pages in parallel, that should speed up the downloads by a huge margin, you can find an example of using curl_multi here.

  2. use compression for the transfer, because .html files compress very well, that should yield a significant performance improvement as well. to use compressed transfer, just set CURLOPT_ENCODING to emptystring, eg curl_setopt($ch,CURLOPT_ENCODING,"");, and curl will use compression for the transfer.

  3. you can micro-optimize the cpu usage by only creating the DOMDocument and DOMXPath elements once, and re-use the elements, because creating them from big html source code takes cpu-time, and your code is creating them twice for no good reason. specifically, this would be faster and use less CPU:

$domd=@DOMDocument::loadHTML($html); $xp=new DOMXPath($domd); $data = $xp->query ( '//span[@class="data"]' )->item ( 1 )->textContent; $data2 = $xp->query ( '//span[@class="data2"]' )->item ( 0 )->textContent;

  1. if they can be cached, having local cached versions of them, perhaps combined with an updating daemon or cronjob, should yield greater performance than all the other 3 optimization approaches above, combined. you can find out how to create local cached copies of data here.. as for how to create daemons or cronjobs, that is OS-specific (on Unix-like systems, like Linux, BSD, or Mac, you'd typically use Cron, on Windows-systems, you'd typically use the at command or Task Scheduler)
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • 1
    @Sourav well, you can find a curl_multi example [here](http://php.net/manual/en/function.curl-multi-exec.php), you can find documentation on using CURLOPT_ENCODING [here](http://php.net/manual/en/function.curl-setopt.php) (albeit you just have to set it to an empty string to use compression, literally `curl_setopt($ch,CURLOPT_ENCODING,"");`, done.), and [here](http://php.net/manual/en/function.file-put-contents.php) you can find out how to create local cached copies of data. as for re-using DOMDocument & DOMXPath, assign them to variables.. see the updated post on DOMDocument/DOMXPath. – hanshenrik Dec 21 '18 at 08:20
  • Your code helped. But the DOM part does not works. Also unable to find any proper example of Curl multi exec – Sourav Dec 23 '18 at 16:24
  • @Sourav the dom part didn't work? what html are you trying to parse? – hanshenrik Dec 23 '18 at 17:28