curl - Scraping large amounts of content from a website

Question

I'm curious if anyone has any recommendations as to the best method to leverage PHP/CURL (or another technology even) to download content from a website. Right now I'm using curl_multi to do 10 requests at a time, which helps some.

I literally need to request about 100K pages daily, which can get a bit tedious (takes 16 hours right now). My initial thoughts are just setting up multiple virtual machines and splitting up the task, but was wondering if there is something else I'm missing besides parallelization. (I know you can always throw more machines at the problem heh)

Thanks in advance!

Caching? It depends what you're requesting? – Nick Fury Mar 08 '13 at 21:53 — Nick Fury, Mar 08 '13 at 21:53

iDev247 · Accepted Answer · 2013-03-08T22:03:38.857

2

It depends what you're doing with the content but try a queuing system.

I suggest Resque. It uses Redis to handle queues. It's designed for speed and multiple requests at the same time. It also has a resque-web option that gives out a nice hosted UI.

You could use one machine to queue up new URLs and then you can have one or multiple machines handling the queues.

Other options: Kestrel, RabbitMQ, Beanstalkd

edited Mar 08 '13 at 22:03

answered Mar 08 '13 at 21:54

iDev247

1,761
3
16
36

Resque is exactly what I need, it's perfect. Thank you so so much! – Geesu Mar 08 '13 at 22:06

score 0 · Answer 2 · edited May 23 '17 at 10:32

0

To retrieve a Web content you can use curl or fsockopen. A comparison between two methods can be see in Which is better approach between fsockopen and curl?.

edited May 23 '17 at 10:32

Community

1
1

answered Mar 08 '13 at 21:55

Mihai8

3,113
1
21
31

curl - Scraping large amounts of content from a website

2 Answers2