2

I'm curious if anyone has any recommendations as to the best method to leverage PHP/CURL (or another technology even) to download content from a website. Right now I'm using curl_multi to do 10 requests at a time, which helps some.

I literally need to request about 100K pages daily, which can get a bit tedious (takes 16 hours right now). My initial thoughts are just setting up multiple virtual machines and splitting up the task, but was wondering if there is something else I'm missing besides parallelization. (I know you can always throw more machines at the problem heh)

Thanks in advance!

Geesu
  • 5,928
  • 11
  • 43
  • 72

2 Answers2

2

It depends what you're doing with the content but try a queuing system.

I suggest Resque. It uses Redis to handle queues. It's designed for speed and multiple requests at the same time. It also has a resque-web option that gives out a nice hosted UI.

You could use one machine to queue up new URLs and then you can have one or multiple machines handling the queues.

Other options: Kestrel, RabbitMQ, Beanstalkd

iDev247
  • 1,761
  • 3
  • 16
  • 36
0

To retrieve a Web content you can use curl or fsockopen. A comparison between two methods can be see in Which is better approach between fsockopen and curl?.

Community
  • 1
  • 1
Mihai8
  • 3,113
  • 1
  • 21
  • 31