I am attempting to scrape a large amount of data from a website. (Probably about 50M records.) The website uses $_GET so it is simply a matter of generating a list of links, each one of which collects a bit of the data.
I have one script that generates a list of links on the screen. The links all call the same PHP script, passing a different search value. I then use the Chrome "LinkClump" extension to start all the links in separate tabs simultaneously (Right-click and drag across all links).
I start 26 tabs at once but the called PHP scripts do not all start. A write to log shows that only 6 ever run at once. The next one will not start until one of the others has finished. Is there any way to get more than 6 running at once?
Here is the relevant snippet of code in the 26 worker scripts that does the search. I simply pass a different $value to each one:
$html = file_get_html("http://website.com/cgi-bin/Search?search=$value");
foreach($html->find('table[cellpadding="3"]') as $e)
foreach($e->find('tr') as $f){
$colval=0;
foreach($f->find('td[class="output"]') as $g)
To check whether it was Apache or simple_html_dom that was throttling the connections I wrote another tiny script that simply did a sleep(10) with a write to log before and after. Once again only 6 would execute at once, hence it must be Apache.
Is there some ini setting that I can change in my script to force more to run at once please?
I noticed this comment in another posting at Simultaneous Requests to PHP Script:
"If the requests come from the same client AND the same browser most browsers will queue the requests in this case, even when there is nothing server-side producing this behaviour."
I am running on Chrome.