0

I am attempting to scrape a large amount of data from a website. (Probably about 50M records.) The website uses $_GET so it is simply a matter of generating a list of links, each one of which collects a bit of the data.

I have one script that generates a list of links on the screen. The links all call the same PHP script, passing a different search value. I then use the Chrome "LinkClump" extension to start all the links in separate tabs simultaneously (Right-click and drag across all links).

I start 26 tabs at once but the called PHP scripts do not all start. A write to log shows that only 6 ever run at once. The next one will not start until one of the others has finished. Is there any way to get more than 6 running at once?

Here is the relevant snippet of code in the 26 worker scripts that does the search. I simply pass a different $value to each one:

$html = file_get_html("http://website.com/cgi-bin/Search?search=$value");
foreach($html->find('table[cellpadding="3"]') as $e)
  foreach($e->find('tr') as $f){
    $colval=0;
    foreach($f->find('td[class="output"]') as $g)

To check whether it was Apache or simple_html_dom that was throttling the connections I wrote another tiny script that simply did a sleep(10) with a write to log before and after. Once again only 6 would execute at once, hence it must be Apache.

Is there some ini setting that I can change in my script to force more to run at once please?

I noticed this comment in another posting at Simultaneous Requests to PHP Script:

"If the requests come from the same client AND the same browser most browsers will queue the requests in this case, even when there is nothing server-side producing this behaviour."

I am running on Chrome.

Community
  • 1
  • 1
user2605793
  • 439
  • 1
  • 8
  • 19
  • Before finding root cause, why do you want to open that much tabs? Lots of tabs make browser crash. You need to find optimal solution – Hüseyin BABAL Apr 06 '14 at 10:39
  • Why don't you do it in one tab within a loop of each arguments ? – nxu Apr 06 '14 at 10:41
  • BABAL - Because I have a life expectancy of about 80 years. The fewer tabs I start the greater the likelihood that I will reach the end of my lifespan before I manage to download all of the data. :-) – user2605793 Apr 06 '14 at 10:41
  • nXu - I used to do it in one tab but it kept crashing with Server error 500. I have set max_execution_time to 0 but it still kept crashing. The only way to stop it was to reduce the amount of work each script does. – user2605793 Apr 06 '14 at 10:43
  • Run your script at the backend not in browser. When operation finished, email result to yourself or save it to db. – Hüseyin BABAL Apr 06 '14 at 10:44
  • The script IS running at the backend. It is a PHP script. It has to be because it is writing the data to an SQL database on the server. I cannot run it locally because of the logistics of then uploading all of the data from my PC to the server. – user2605793 Apr 06 '14 at 10:46
  • 1
    Wait, am I missing something here or are you trying to "parallelize" 50M requests to an external site by opening 50M browser tabs in groups of 26? If so, you're doing it terribly wrong. – lafor Apr 06 '14 at 10:59
  • A task like this should not need any involvement from Apache or a browser; run your script on the command line or via cron and you should have no problem running it as many times as you like. For a more advanced solution, look into SupervisorD or GearMan, which can manage parallel tasks like this for you. – IMSoP Apr 06 '14 at 11:09
  • Lafor, I am not trying to open 50M tabs at once. I am trying to scrape a lot of data in small batches. (The data is public information and not subject to any copyright.) If I opened thousands of sessions it would kill the server, but only opening 6 is taking too long. So I want to open 26, let them complete, then start another 26. – user2605793 Apr 09 '14 at 03:25

2 Answers2

2

Browsers typically limit the number of concurrent connections to a single domain. Each successive tab opened after this limit has been reached will have to wait until an earlier one has completed.

A common trick to bypass this behaviour is to spread the resources over several subdomains. So, currently you're sending all your requests to website.com. Change your code to send six requests each to, say, sub1.website.com, sub2.website.com, etc. You'll need to set these up on your DNS and web server, obviously. Provided your PHP script exists on each subdomain you should be able to run more connections concurrently.

0

I found the answer here: Max parallel http connections in a browser?

It is a browser issue. It indicates that Firefox allows the limit to be increased so I will try that.

For the benefit of others, here is what you have to do to allow Firefox to have more than 6 sessions with the one host. It is slightly different from the above post.

1. Enter about:config
2. Accept the warranty warning
3. Find network.http.max-persistent-connections-per-server and change it from 6 to whatever value you need. 

You can now run more scripts on that host from separate tabs.

If this is useful information please up-vote question. I need to get rid of negative reputation.

Community
  • 1
  • 1
user2605793
  • 439
  • 1
  • 8
  • 19