3

I have a search engine in production serving around 700 000 url. The crawling is done using Scrapy, and all spiders are scheduled using DeltaFetch in order to get daily new links.

The difficulty I'm facing is handling broken links.

I have a hard time finding a good way to periodically scan, and remove broken links. I was thinking about a few solutions :

  • Developing a python script using requests.get, to check on every single url, and delete anything that returns a 404 status.
  • Using a third party tool like https://github.com/linkchecker/linkchecker, but not sure if it's the best solution since I only need to check up a list of url, not a website.
  • Using a scrapy spider to scrap this url list, and return any urls that are erroring out. I'm not really confident on that one since I know scrapy tends to timeout when scaning a lot of urls on different domains, this is why I rely so much on deltafetch

Do you have any recommendations / best practice to solve this problem?

Thanks a lot.

Edit : I forgot to give one precision : I'm looking to "validate" those 700k urls, not to crawl them. actually those 700k urls are the crawling result of around 2500k domains.

romain-lavoix
  • 403
  • 2
  • 6
  • 20
  • I've written a simple script that uses urllib and checks that http status code returned. It's not only 404 though, you should probably check other status codes as well, like 503 etc. – Ismailp Oct 25 '18 at 08:01
  • I'm assuming the list can be quite large so consider using fetching URLs in parallel in multiple threads (it's mostly IO wait, so you won't be limited by `GIL`) or even better use `asyncio`, something like mentioned here: https://stackoverflow.com/questions/35926917/asyncio-web-scraping-101-fetching-multiple-urls-with-aiohttp – zxxc Oct 25 '18 at 08:11

4 Answers4

5

You could write a small script that just check the return http status like so:

for url in urls:
    try:
        urllib2.urlopen(url)
    except urllib2.HTTPError, e:
        # Do something when request fails
        print e.code

This would be the same as your first point. You could also run this async in order to optimize the time it takes to run through your 700k links.

Ismailp
  • 2,333
  • 4
  • 37
  • 66
1

I would suggest using scrapy, since you're already looking up each URL with this tool and thus knows which URLs errors out. This means you don't have to check the URLs a second time.

I'd go about it like this:

  • Save every URL erroring out in a separate list/map with a counter (which is stored between runs).
  • Every time an URL errors out, increment the counter. If it doesn't, decrement the counter.
  • After running the Scrapy script, check this list/map for URLs with a high enough counter - let's say more than 10 faults, and remove them - or store them in a seperate list of links to check up on a later time (As a check if you accidentally removed a working URL because a server was down too long).

Since your third bullet is concerned about Scrapy being shaky with URL results, the same could be said for websites in general. If a site errors out on 1 try, it might not mean a broken link.

IAmBullsaw
  • 93
  • 6
1

If you go for creating a script of our own check this solution
In addition an optimization that I suggest is to make heirarchy in your URL repository. If you get 404 from one of a parent URL you can avoid checking all it children URLs

GyRo
  • 2,586
  • 5
  • 30
  • 38
  • thank you! However I'm not looking to crawl, those 700k urls are already the result of a giant crawl. The script looks really good, I'll edit it and try it on my side. – romain-lavoix Oct 25 '18 at 20:20
1
  1. First thought came into my mind is to request URLs with HEAD instead of any other method
  2. Spawn multiple spiders at once assigning them batches like LIMIT 0,10000 and LIMIT 10000,10000
  3. In your data pipeline, instead of running a MySQL DELETE query each time scraper finds 404 status, run DELETE FROM table WHERE link IN(link1,link2) query in bulk
  4. I am sure you have INDEX on link column, if not add it
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • requesting HEAD only is a good idea! However I don't use SQL at all but I see your point regarding the query in bulk – romain-lavoix Oct 25 '18 at 20:18
  • @roma98 No.*2* is also good idea no matter what data-source scraper is reading links from, you can achieve double speed by spawning multiple instances – Umair Ayub Oct 26 '18 at 06:04
  • Sometimes head request return 404 for a live page. I think GET is more suitable. –  Oct 13 '22 at 11:49