I have a search engine in production serving around 700 000 url. The crawling is done using Scrapy, and all spiders are scheduled using DeltaFetch in order to get daily new links.
The difficulty I'm facing is handling broken links.
I have a hard time finding a good way to periodically scan, and remove broken links. I was thinking about a few solutions :
- Developing a python script using requests.get, to check on every single url, and delete anything that returns a 404 status.
- Using a third party tool like https://github.com/linkchecker/linkchecker, but not sure if it's the best solution since I only need to check up a list of url, not a website.
- Using a scrapy spider to scrap this url list, and return any urls that are erroring out. I'm not really confident on that one since I know scrapy tends to timeout when scaning a lot of urls on different domains, this is why I rely so much on deltafetch
Do you have any recommendations / best practice to solve this problem?
Thanks a lot.
Edit : I forgot to give one precision : I'm looking to "validate" those 700k urls, not to crawl them. actually those 700k urls are the crawling result of around 2500k domains.