A guest has a website A has a partner program.
Partner sites B have links to A.
I need to check with a certain frequency (twice a day) all webpages of all the partner sites (5000 sites) and extract all the links from B to A. Then I have to check with a regexp if the url is built in a certain way.
I could easily do this with PHP, but there are some serious challanges that maybe a third party solution has already faced
- I want to leverage bandwith usage
- I want the task to be done the fastest possible
- The webpages to check could amateurs web pages full of errors and inconsistent html
- I'd like to manage only webpages that are changed since the last time I checked them
- the process has to be automated (cron? or alternatives?)
- ...
- (feel free to expand this list)
But I don't want to build a super-duper-mega-ultra-sophisticated-that-does-everithing-and-more-tool...
I'd stille like to have a small and lightweight clever solution.
How would you solve a task like this?