0

A guest has a website A has a partner program.
Partner sites B have links to A.
I need to check with a certain frequency (twice a day) all webpages of all the partner sites (5000 sites) and extract all the links from B to A. Then I have to check with a regexp if the url is built in a certain way.

I could easily do this with PHP, but there are some serious challanges that maybe a third party solution has already faced

  • I want to leverage bandwith usage
  • I want the task to be done the fastest possible
  • The webpages to check could amateurs web pages full of errors and inconsistent html
  • I'd like to manage only webpages that are changed since the last time I checked them
  • the process has to be automated (cron? or alternatives?)
  • ...
  • (feel free to expand this list)

But I don't want to build a super-duper-mega-ultra-sophisticated-that-does-everithing-and-more-tool...
I'd stille like to have a small and lightweight clever solution.

How would you solve a task like this?

skaffman
  • 398,947
  • 96
  • 818
  • 769
nulll
  • 1,465
  • 1
  • 17
  • 28

1 Answers1

0
 - I want to leverage bandwith usage
 - I want the task to be done the fastest possible
 - The webpages to check could amateurs web pages full of errors and inconsistent html
 - I'd like to manage only webpages that are changed since the last time I checked them the process has to be automated (cron? or alternatives?)
 - (feel free to expand this list)

Those are some pretty hefty requirements.

But I don't want to build a *super-duper-mega-ultra-sophisticated-that-does-everithing-and-more-tool*...

Oh, well, no problem then... now that you said that, I think we've narrowed it down to a super-duper-mega-ultra-sophisticated-that-does-everithing-and-more-tool that's NOT a super-duper-mega-ultra-sophisticated-that-does-everithing-and-more-tool.

Jokes aside, there are not a whole lot of tools that are capable of doing what you described. However, there are some pretty robust tools out there that might provide you a good framework for you achieve your goals. You mentioned PHP, but I think that you're going to have more success in the Java world. In particular, I would recommend that you check out Nutch.

  • It allows you to control your bandwidth usage via the configuration options.
  • It's one of the fastest open-source crawlers (if not the fastest).
  • It's good at reading bad HTML (to the extent that it's possible).
  • Nutch is pretty good at efficiently selecting pages that need to be crawled because it does implement the OPIC algorithm, however the task of focusing on freshness is quite challenging. You may have to write your own plugin to get more fine-grained freshness focus.

I hope that helps :).

Community
  • 1
  • 1
Kiril
  • 39,672
  • 31
  • 167
  • 226
  • :) when i said a _super-duper-mega-ultra-sophisticated-that-does-everithing-and-more-tool_ i was thinking to Nutch. Ok I understand that my requirements could need a robust tool like Nutch. I'll play with it during the weekend... thanks – nulll Feb 03 '12 at 08:30