1

Our servers are under great pressure when our web pages are scraped by many clients. Sometimes our web pages are being scraping from many different IP addresses which do not belong to some famous spiders like Google,Bing. So our defensive strategy based on IP addresses is not useful. We want some of our web pages to be crawled by normal spiders with proper frequencies, but we are against anyone who could bring damage to our server. Caching may be an option. But we have so many urls for seo. For example, we have some urls which have the pattern "https://www.xxxx.com/hot-goods/mobile-phone-1.html". This page shows a list of products about mobile phone. There are thousands of pages for the search result of a single search word. So the hit rate of caching may be not very high. So I just wonder if there is any other solutions to reduce the pressure of our servers.

yifan
  • 31
  • 5
  • Do you want to allow scraping in the first place? The first thing I'd do is block the worst offenders by user-agent – Joni Dec 09 '18 at 05:29
  • @Joni I don't want to allow scraping for non-spider clients. The strategy based on user-agent is not a good solution as it can be modifed by so many http client tools. – yifan Dec 09 '18 at 05:33
  • May be using Captcha will help you. Refer this for more details https://stackoverflow.com/questions/3161548/how-do-i-prevent-site-scraping . It also mentions some captcha services which you can use. (No need to build one) – A_C Dec 09 '18 at 06:04

1 Answers1

0

Apart from having a robots.txt file, which impolite crawlers would probably ignore anyway, you could provide a sitemap.xml file to list all your pages. Crawlers would go for those instead of using the search functionality of your site, which would reduce the load. This is also a way of avoiding multiple requests for the same content when the URLs only differ in a few parameters.

If you can't avoid them, make their work simpler so that they are less of a nuisance.

Julien Nioche
  • 4,772
  • 1
  • 22
  • 28
  • We do have a sitemap.xml file. But the issue does not have something to do with robots.txt or sitemap.xml. Some malicious robots deployed their programs on hundreds of cloud servers to scrape data with high frequencies which sometimes bring greate pressure to our server. – yifan Dec 10 '18 at 01:08