2

Context:

  • I am doing a robot to read the news block on the first page of google results. I need the results for 200 search queries (totally need to read 200 pages).

  • To avoid being blocked by google, must wait some time to do the next search (from the same ip). If you wait 30 seconds between each search, reading the 200 pages it will take (200 * 30/60) = 1h40m.

  • But as news of google results change very fast, I need those 200 pages are accessed almost simultaneously. So reading all 200 pages should take only a few minutes.

  • If work is divided between 20 proxies (ips), it will take (200/20 * 30/60) = 5m (20 proxies running simultaneously)

  • I was planning to use pthreads through cli.

Question / Doubt:

  1. Is it possible to run 20 threads simultaneously? Is it advisable to run only a few trheads?

  2. What if I want to run 100 threads (using 100 proxies)?

  3. What other options do I have?

Edit:

I found another option, using php curl_multi or the many libraries written over curl_multi for this purpose. But I think I'll stick to pthreads.

Community
  • 1
  • 1
luistar15
  • 163
  • 1
  • 9
  • I already use parallel script in PHP, u can read [example](http://modulaweb.fr/blog/2012/12/multithread-en-php-une-methode-simple-et-fiable/) but it is in french. This script use exec function and memcached to communicate between mother and son – Benjamin Poignant Nov 14 '14 at 23:31

2 Answers2

3

Is it possible to run 20 threads simultaneously?

Some hardware has more than 20 cores, in those cases, it is a no brainer.

Where your hardware has less than 20 cores, it is still not a ridiculous amount of threads, given that the nature of the threads will mean they spend some time blocking waiting for I/O, and a whole lot more time purposefully sleeping so that you don't anger Google.

Ordinarily, when the threading model in use is 1:1, as it is in PHP, it's a good idea in general to schedule about as many threads as you have cores, this is a sensible general rule.

Obviously, the software that started before you (your entire operating system) has likely already scheduled many more threads than you have cores.

The best case scenario still says you can't execute more threads concurrently than you have cores available, which is the reason for the general rule. However, many of the operating systems threads don't actually need to run concurrently, so the authors of those services don't go by the same rules.

Similarly to those threads started by the operating system, you intend to prohibit your threads executing concurrently on purpose, so you can bend the rules too.

TL;DR yes, I think that's okay.

What if I want to run 100 threads ?

Ordinarily, this might be a bit silly.

But since you plan to force threads to sleep for a long time in between requests, it might be okay here.

You shouldn't normally expect that more threads equates to more throughput. However, in this case, it means you can use more outgoing addresses more easily, sleep for less time overall.

Your operating system has hard limits on the number of threads it will allow you to create, you might well be approaching these limits on normal hardware at 100 threads.

TL;DR in this case, I think that's okay.

What other options do I have?

If it weren't for the parameters of your operation; that you need to sleep in between requests, and use either specific interfaces or proxies to route requests through multiple addresses, you could use non-blocking I/O quite easily.

Even given the parameters, you could still use non-blocking I/O, but it would make programming the task much more complex than it needs to be.

In my (possibly bias) opinion, you are better off using threads, the solution will be simpler, with less margin for error, and easier to understand when you come back to it in 6 months (when it breaks because Google changed their markup or whatever).

Alternative to using proxies

Using proxies may prove to be unreliable and slow, if this is to be a core functionality for some application then consider obtaining enough IP addresses that you can route these requests yourself using specific interfaces. cURL, context options, and sockets, will allow you to set outbound address, this is likely to be much more reliable and faster.

While speed is not necessarily a concern, reliability should be. It is reasonable for a machine to be bound to 20 addresses, it is less reasonable for it to be bound to 100, but if needs must.

Joe Watkins
  • 17,032
  • 5
  • 41
  • 62
  • Thank you for the clarification, after reading even more, I'm comfortable sticking with threads. About using bound ip addresses instead of proxies, I had not thought of before; could I use ipv6 addresses? (they are cheap, almost free); I intend to use aws. I'm probably speaking nonsenses, I'll better investigate. – luistar15 Nov 17 '14 at 23:22
0

Why don't you just make a single loop, which walks through the proxies ?

This way it's just one process at a time, and also you can filter out dead proxies, and still you can get the desired frequency of updates.

You could do something like this:

$proxies=array('127.0.0.1','192.168.1.1'); // define proxies
$dead=array(); // here you can store which proxies went dead (slow, not responding, up to you)
$works=array('http://google.com/page1','http://google.com/page2'); // define what you want to do
$run=true; $last=0; $looptime=(5*60); // 5 minutes update
$workid=0; $proxyid=0;
while ($run)
{
 if ($workid<sizeof($works))
 { // have something to do ...
  $work=$works[$workid]; $workid++; $success=0;
  while (($success==0)and($proxyid<sizeof($proxies)))
  {
   if (!in_array($proxyid,$dead))
   {
    $proxy=$proxies[$proxyid]; 
    $success=launch_the_proxy($work,$proxy);
    if ($success==0) {if (!in_array($proxyid,$dead)) {$dead[]=$proxyid;}}
   } 
   $proxyid++;
  }
 }
 else
 { // restart the work sequence once there's no more work to do and loop time is reached
  if (($last+$looptime)<time()) {$last=time(); $workid=0; $proxyid=0;}
 }
 sleep(1);
}

Please note, this is a simple example, you have to work on the details. You must also keep in mind, this one requires at least equal number of proxies compared to the work. (you can tweak this later as you wish, but that needs a more complex way to determine which proxy can be used again)

Gipsz Jakab
  • 433
  • 3
  • 9
  • Because the pages contains news, and I need to read them almost simultaneously. – luistar15 Nov 17 '14 at 23:09
  • And you want to poll them every 30 seconds? that seems a bit extreme for me ... Does your sources have RSS feed? if yes, you could check that for new items first ... – Gipsz Jakab Nov 18 '14 at 15:14
  • I want to read them (all the 200 pages) every 15-20 minutes. Every request must wait around 30sec (from the same ip), so I will use 20 proxies to divide the 200 requests, each proxy will read 10 pages every 30sec (5min). So, all the job will take around 5min running 20 proxies simultaneously. – luistar15 Nov 18 '14 at 16:56
  • 200 pages need to be shared among 20 proxies, that's 10 request / proxy, if you can retry in 30 seconds, that gives an update interval of 300 seeconds (5 minutes) only ! I think this is quite a reasonable update interval, still you need to make 200 requests only within 5 minutes. If you're walking through the proxies, you can download the pages one after another, still you can keep the proxy wait time ... – Gipsz Jakab Nov 19 '14 at 14:11
  • so it will look like this: page1 => proxy1 page2 => proxy2 page3 => proxy3 ... page20 => proxy20 page21 => proxy1 (note: here you have to check if the time since last request for proxy1 have passed by 30 seconds or not) page22 => proxy2 (check proxy2 ...) etc.etc. page199 => proxy19 – Gipsz Jakab Nov 19 '14 at 14:13
  • Yes, that's my logic. – luistar15 Nov 19 '14 at 19:25
  • just think about, one request takes time around 1~5 seconds, not more. So you can simply walk through the proxies, also the urls. This way you wont overload YOUR server, and wont overload THEIR server on the other side. But if you want to spend 20x more resource on the same result, that's up to you ... – Gipsz Jakab Nov 20 '14 at 10:24