4

Is there any fast (maybe multi-threaded) way to crawl my site (clicking on all local links) to look for 404/500 errors (i.e. ensure 200 response)?

I also want to be able to set it to only click into 1 of each type of link. So if I have 1000 category pages, it only clicks into one.

Is http://code.google.com/p/crawler4j/ a good option?

I'd like something that is super easy to set up, and I'd prefer PHP over Java (though if Java is significantly faster, that would be ok).

Ryan
  • 22,332
  • 31
  • 176
  • 357
  • This question will be more suitable in: http://webmasters.stackexchange.com – Nir Alfasi Jul 24 '12 at 21:33
  • I feel like a solution that involved examining the directory structure without brute forcing HTTP requests would be optimal by far. That will only help for 404 errors though, 500 ones could still remain. – Wug Jul 24 '12 at 21:34

3 Answers3

3

You can use the old and stable Xenu tool to crawl your site.

You can configure him to use 100 threads and sort the results by status code[500\404\200\403]

Menashe Avramov
  • 386
  • 2
  • 7
  • That's pretty cool, but ideally I could run the crawl automatically as part of a build process. Thanks! – Ryan Jul 24 '12 at 23:16
  • 1
    Hay Ryan if you pay for xenu you can get a version that has Command-line parameters and run it automatically more info at: http://home.snafu.de/tilman/xenulink.html#Future – Menashe Avramov Jul 24 '12 at 23:24
  • I haven't tried this yet, but the lead dev at my company independently recommended this too, so I'll mark yours as the answer. – Ryan Jul 26 '12 at 03:46
0

You could implement this pretty easily with any number of open source python projects:

  1. Mechanize seems pretty popular
  2. Beautiful Soup and urllib

You'd crawl the site using one of those methods and check the server response, which should be pretty straight forward.

However, if you have a sitemap (or any sort of list with all of your URLs), you could just try and open each one using cURL, or urllib, and get your response without the need to crawl.

Julio
  • 2,261
  • 4
  • 30
  • 56
0

Define "fast"? how big is your site? cURL would be a good start: http://curl.haxx.se/docs/manual.html

Unless you hae a really immense site and need to test it on a time scale of seconds, just enumerate the URLs into a list and try each one.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263