1

I using simple_html_dom for scrape pages website, the problem is if i want to scrape many page like 500 url pages that takes a long time (5-30 minutes) to complete, and thats make my server error 500.

Some of these things I've done is:

  1. try using set_time_limit
  2. set ini_set('max_execution_time')
  3. add delay() timing

I many read from stackoverflow to use cronjob to split Long Running PHP Scripts, my question is How to split Long Running PHP Scripts ? can u give best way to split it ? can u give me step by step script because iam a beginner.

About my program, i have two file : file 1, i have array more than 500 link url file 2, this file have function to process scrape

example this is file 1:

set_time_limit(0);
ini_set('max_execution_time', 3000); //3000 seconds = 30 minutes
$start = microtime(true); // start check render time page
error_reporting(E_ALL);
ini_set('display_errors', 1);
include ("simple_html_dom.php");
include ("scrape.php");

$link=array('url1','url2','url3'...);
array_chunk($link, 25); // this i try to split for 25 but not working
$hasilScrape = array();
for ( $i=1; $i<=count($link); $i++){
    //this is the process i want to call function get_data to scrape
    $hasilScrape[$i-1] = json_decode(get_data($link[$i-1]), true);
}

$filename='File_Hasil_Scrape';
$fp = fopen($filename . ".csv", 'w');
foreach ($hasilScrape as $fields) {
    fputcsv($fp, $fields);
}
fclose($fp);

i have thinking can i split array link for 25 array and thank i pause or make it stop for temporary (NOT DELAY because i have been try it no useless) the proses and run again, can u tell me please, thank you so much.

CSM Media
  • 61
  • 7
  • Where does your array of links come from? – Martin Nov 19 '17 at 10:50
  • 3000 !== 30 minutes. `3000seconds == 50 minutes. ` – Martin Nov 19 '17 at 10:52
  • if you get a server error 500, what is your [***PHP*** error](https://stackoverflow.com/questions/845021/how-to-get-useful-error-messages-in-php)? – Martin Nov 19 '17 at 10:52
  • Queues and sub workers – Scuzzy Nov 19 '17 at 10:57
  • 'url1','url2','url3'... from file 1 Mr @Martin – CSM Media Nov 19 '17 at 11:05
  • yes..... but how does file1 get this data, Where does the datacome from? is it a JSON import, is it an opened listing file, are they manually typed into the hardcode, etc. ? – Martin Nov 19 '17 at 11:06
  • @Scuzzy what you mean ? i don't understand, please tell me specific – CSM Media Nov 19 '17 at 11:08
  • 1
    You've got a lot of work to do, you need to maintain a queue of tasks that can be completed by many sub processes chipping away at small parts of work. You're going to have to look beyond just a single script execution to process, or repeating worker tasks with cron etc – Scuzzy Nov 19 '17 at 11:10
  • @Martin from function get_data, get_date function in file 2, i just call function and get it from file 2. i dont know how you mean manually typed into the hardcode – CSM Media Nov 19 '17 at 11:10
  • `$link=array('url1','url2','url3'...);` This is manually hardcoded. Typed in. – Martin Nov 19 '17 at 11:12
  • @Scuzzy i bealive it have solution for small step, but i dont how to fix it, my code running well not error but just to slow running. – CSM Media Nov 19 '17 at 11:12
  • @Martin no, i import link from filetext and convert it to array – CSM Media Nov 19 '17 at 11:13
  • If your data list comes from a file full of URLs then you need to open the file and import the URLS in managable block sizes (*X number at a time*) and keep a record of what number you reach. Use something like a `session` counter to see how many lines down the `file2` fle you're reaching, until it's complete – Martin Nov 19 '17 at 11:13
  • Yes I was simply showing you what hardcoded was `:-p` – Martin Nov 19 '17 at 11:14
  • @Martin yes, i think iam using hardcoded because iam begening :D – CSM Media Nov 19 '17 at 11:16
  • @Martin i dont understand using session hehe... – CSM Media Nov 19 '17 at 11:18
  • You need to make use of Google and read a lot about programming with PHP – Martin Nov 19 '17 at 11:21
  • You just said you're **not** using hardcoded because the data is coming from file2, please make your mind up! – Martin Nov 19 '17 at 11:21
  • my error Request Timeout This request takes too long to process, it is timed out by the server. If it should not be timed out, please contact administrator of this web site to increase 'Connection Timeout'. – CSM Media Nov 19 '17 at 11:50
  • solved, the error because i just process to long and not make output, so i make echo in loop. – CSM Media Nov 20 '17 at 08:16

0 Answers0