0

I want to parse values from website category with paginated posts. Information that I need is inside posts. I tried to use Simple HTML DOM to do that. I got it, but I did not think that is correct. The script works slowly and with a large amount of data I get the error

Maximum execution timeout of 300 seconds

 <?php

    include('simple_html_dom.php');

    $total_pages=600;
    $i = 1;

    while ($i <= $total_pages):

$html = file_get_html(''.$url.'/'.$from.'');

foreach($html->find('.itemReview h3 a') as $a) {

    $post = file_get_html('http://www.website.com/'.$a->href.'');

    $author_mail = $post->find('.sellerAreaSecond',0);
    $author_mail = $post->plaintext;
    $a_mail_array[] = $author_mail;
}


$fp = fopen('file.csv', 'w');

foreach( $a_mail_array as $ddd) {
fputcsv($fp, array($ddd));



   }
    fclose($fp);


$from++;
endwhile;


    ?>
user3514052
  • 418
  • 4
  • 18
  • 1
    You should increase the execution time limit with the set_time_limit() function. http://php.net/manual/en/function.set-time-limit.php. Are you executing it from the command line ? – Andreas Nov 16 '16 at 18:14
  • I run it from browser. Your suggestion helped, now i can execute my script longer. But if I want parse more than 100 pages I get 500 Server error :( – user3514052 Nov 17 '16 at 17:39
  • You might want to increase the allowed memory limit in your script. ini_set('memory_limit', '-1'); for unlimited memory. You might also want to read this http://stackoverflow.com/questions/11885191/how-to-increase-memory-limit-for-php-over-2gb – Andreas Nov 17 '16 at 18:05
  • I updated my first post. I've added a new feature to record the results in the csv file. This further slows down the script. Can you have any suggestions how to simplify it? – user3514052 Nov 17 '16 at 19:18

2 Answers2

0

As you are requesting your pages and the posts inside them via network, of course this is slow and you run into script timeout with large amounts af data. Try increasing the max execution time in your php.ini file.

Hokascha
  • 1,709
  • 1
  • 23
  • 41
0

One solution would be to increase the time limit in your server settings (php.ini)

A better one would be not to have your server download 100 pages from itself and parse them. Parsing HTML takes tons of time, it has to go through all the code and find your .read_more a and .authoremail. I suspect you're working on plain files for data storage, if that's the case you should switch to a database like MySQL or even SQLite, then you can just query the database - which takes considerably less time. This not only makes your website not crash when then's more content, but also speeds it up.

With SQL, you could just store author's email in a table and then use SELECT authoremail FROM posts and then use foreach(). This also enables you to do things like sorting by date, name etc. on the fly. Just letting your website run slow and inefficient my increasing the time limit is probably not a good idea.

Qrchack
  • 899
  • 1
  • 10
  • 20