I'm interested in public facing sites (nothing behind a login / authentication) that have things like:
- High use of internal 301 and 302 redirects
- Anti-scraping measures (but not banning crawlers via robots.txt)
- Non-semantic, or invalid mark-up
- Content loaded via AJAX in the form of onclicks or infinite scrolling
- Lots of parameters used in urls
- Canonical problems
- Convoluted internal link structure
- and anything else that generally makes crawling a website a headache!
I have built a crawler / spider that performs a range of analysis on a website, and I'm on the lookout for sites that will make it struggle.