1

I've modified a basic web-crawler to gather up a list of links to a site, which is likely to run into the thousands.The problem I'm having is that the script is timing out once I try and run it through a browser on top of this its been mentioned in a previous question I asked that there also may be an issue with the script running to many processes at the same time killing the server I run it on.

How would I got about fixing these issues or should I go with a open source crawler and if so which crawler should I go with as I can't find anything specific enough,as phpDig site is down :/

previous question

Community
  • 1
  • 1
dbomb101
  • 413
  • 1
  • 8
  • 21
  • No matter which script you are going to use, the only realistic way is to put the crawler into the background (with a cron job). – Wukerplank Apr 13 '11 at 11:48

1 Answers1

1

Processes like this are best run as PHP CLI cron jobs.

If you need to be able to run it on demand from a web interface then consider adding it to a queue to be run in the background using Gearman or even the unix at command.

It so happens that I have written a PHP wrapping class for the linux at job queue, which is available from my github account should you choose to go down that route.

Treffynnon
  • 21,365
  • 6
  • 65
  • 98