1

I am trying to webscrape a website and their robots.txt file says this:

(where zoeksuggestie is search suggestion in english)

User-agent: *

# Miscellaneous
Disallow: /mijn/
Disallow: /*/print/*
Disallow: /koop/zoeksuggestie/
Disallow: /huur/zoeksuggestie/
Disallow: /nieuwbouw/zoeksuggestie/
Disallow: /recreatie/zoeksuggestie/
Disallow: /europe/zoeksuggestie/
Disallow: /*/brochure/download/
Disallow: */uitgebreid-zoeken/*
Disallow: /makelaars/*/woningaanbod/*
Disallow: /zoekwidget/*
Allow: /zoekwidget/$
Disallow: /relatedobjects
Disallow: /mijn/huis/wonen/toevoegen/
Disallow: /*/woningrapport/

# Prevent bots from indexing combinations of locations
Disallow: /koop/*,*
Disallow: /huur/*,*
Disallow: /nieuwbouw/*,*
Disallow: /recreatie/*,*
Disallow: /europe/*,*

Does this mean I can't scrape any link that is /koop/*,* ? what does the *,*mean? I really need to get data from this website for a project, but I keep getting blocked using scrapy/beautiful soup.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • 1
    You can always ignore robots.txt, although it is impolite to do so and is likely to get your scraper blocked. https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey – Iain Shelvington Nov 03 '19 at 00:25

1 Answers1

3

The robots.txt file is part of the “Robots exclusion standard” whenever a bot visits a website, they check the robots.txt file to see what they can’t access. Google uses this to not index or at least publicly display URLs matching those in the robots.txt file.

The file is however not mandatory to comply with the robots.txt. The * is a wildcard so /koop/*,* will match anything with /koop/[wildcard],[wildcard]. Here is a great guide on wildcards in robots.txt https://geoffkenyon.com/how-to-use-wildcards-robots-txt/

You mentioned scrapy not working, that is because scrapy follows the robots.txt by default. This can be disabled in settings, that question has been answered here: getting Forbidden by robots.txt: scrapy

bobveringa
  • 230
  • 2
  • 12
  • so does this mean that a link like /koop/example1/example2/example3 would not get blocked? –  Nov 03 '19 at 00:38
  • Based on the robots.txt that would not get blocked. I did notice that this robots.txt is most likely for a Dutch house searching site, which I will not name here. Whenever the path after `koop/` is not found it replaces the `/` with `,` causing the site to be blocked. You can use a robots.txt tester like [https://technicalseo.com/tools/robots-txt/](https://technicalseo.com/tools/robots-txt/) to test if URLs will get blocked – bobveringa Nov 03 '19 at 00:54
  • thanks a lot! can you take a look at my code for the scraping and let me know what resources i could use to improve it? –  Nov 03 '19 at 01:20