-1

Trying to cut down on the number of websites lifting our data. Here's a detailed example at this Stackoverflow link:

Scrapy not following pagination properly, catches the first link in the pagination

I'm relatively new to this, but based on the information from that previous link, is there anyway to block this particular scraper?

  • I think usually it's done but having a robots.txt file together with your webpage (http://www.robotstxt.org/). It will limit getting data from your webpage in a large scale but if the crawler sets some time delay, they can still get around it and grab data from your web (but much slower). – TYZ Sep 25 '18 at 14:00
  • 1
    @YilunZhang robots.txt is just a text file, it does not prevent anyone from scraping the site. Some bots (search engines) choose to honor the requests in robots.txt, that's all. –  Sep 25 '18 at 14:43

1 Answers1

0

I think the only efficient way to prevent your site from being scraped is to soft-ban IPs and restrict the amount of requests they are allowed to make in a given timeframe. The robots.txt can be useful for scrapers like Google, but most scrapers don't even check for that and it's in no way possible to disallow the indexing of your site.

Severin
  • 8,508
  • 14
  • 68
  • 117