1

I'm trying to respect the robots.txt file, while webcrawling, and I encountered something weird. The the robots.txt URL I'm trying to access is: https://podatki.gov.si/robots.txt

If I open this link in Chrome, I get this:

User-agent: *
Disallow: /

But if I open this link with Internet Explorer or Selenium WebDriver (ChromeDriver), I get this:

#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

User-agent: *
Crawl-delay: 10
# CSS, JS, Images
Allow: /misc/*.css$
Allow: /misc/*.css?
Allow: /misc/*.js$
Allow: /misc/*.js?
Allow: /misc/*.gif
Allow: /misc/*.jpg
Allow: /misc/*.jpeg
Allow: /misc/*.png
Allow: /modules/*.css$
Allow: /modules/*.css?
Allow: /modules/*.js$
Allow: /modules/*.js?
Allow: /modules/*.gif
Allow: /modules/*.jpg
Allow: /modules/*.jpeg
Allow: /modules/*.png
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /themes/*.css$
Allow: /themes/*.css?
Allow: /themes/*.js$
Allow: /themes/*.js?
Allow: /themes/*.gif
Allow: /themes/*.jpg
Allow: /themes/*.jpeg
Allow: /themes/*.png
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/

Why does this happen? The latter seems to be a generic robots.txt file, maybe something autogenerated?

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
kozeljko
  • 160
  • 2
  • 13

1 Answers1

0

I have observed the same behavior as follows:

  • When accessed the webpage https://podatki.gov.si/robots.txt manually, I got:

    User-agent: *
    Disallow: /
    
  • When accessed the webpage https://podatki.gov.si/robots.txt using ChromeDriver and Chrome, I got:

    #
    # robots.txt
    #
    # This file is to prevent the crawling and indexing of certain parts
    # of your site by web crawlers and spiders run by sites like Yahoo!
    # and Google. By telling these "robots" where not to go on your site,
    # you save bandwidth and server resources.
    #
    # This file will be ignored unless it is at the root of your host:
    # Used:    http://example.com/robots.txt
    # Ignored: http://example.com/site/robots.txt
    #
    # For more information about the robots.txt standard, see:
    # http://www.robotstxt.org/robotstxt.html
    
    User-agent: *
    Crawl-delay: 10
    # CSS, JS, Images
    Allow: /misc/*.css$
    Allow: /misc/*.css?
    Allow: /misc/*.js$
    Allow: /misc/*.js?
    Allow: /misc/*.gif
    Allow: /misc/*.jpg
    Allow: /misc/*.jpeg
    Allow: /misc/*.png
    Allow: /modules/*.css$
    Allow: /modules/*.css?
    Allow: /modules/*.js$
    Allow: /modules/*.js?
    Allow: /modules/*.gif
    Allow: /modules/*.jpg
    Allow: /modules/*.jpeg
    Allow: /modules/*.png
    Allow: /profiles/*.css$
    Allow: /profiles/*.css?
    Allow: /profiles/*.js$
    Allow: /profiles/*.js?
    Allow: /profiles/*.gif
    Allow: /profiles/*.jpg
    Allow: /profiles/*.jpeg
    Allow: /profiles/*.png
    Allow: /themes/*.css$
    Allow: /themes/*.css?
    Allow: /themes/*.js$
    Allow: /themes/*.js?
    Allow: /themes/*.gif
    Allow: /themes/*.jpg
    Allow: /themes/*.jpeg
    Allow: /themes/*.png
    # Directories
    Disallow: /includes/
    Disallow: /misc/
    Disallow: /modules/
    Disallow: /profiles/
    Disallow: /scripts/
    Disallow: /themes/
    # Files
    Disallow: /CHANGELOG.txt
    Disallow: /cron.php
    Disallow: /INSTALL.mysql.txt
    Disallow: /INSTALL.pgsql.txt
    Disallow: /INSTALL.sqlite.txt
    Disallow: /install.php
    Disallow: /INSTALL.txt
    Disallow: /LICENSE.txt
    Disallow: /MAINTAINERS.txt
    Disallow: /update.php
    Disallow: /UPGRADE.txt
    Disallow: /xmlrpc.php
    # Paths (clean URLs)
    Disallow: /admin/
    Disallow: /comment/reply/
    Disallow: /filter/tips/
    Disallow: /node/add/
    Disallow: /search/
    Disallow: /user/register/
    Disallow: /user/password/
    Disallow: /user/login/
    Disallow: /user/logout/
    # Paths (no clean URLs)
    Disallow: /?q=admin/
    Disallow: /?q=comment/reply/
    Disallow: /?q=filter/tips/
    Disallow: /?q=node/add/
    Disallow: /?q=search/
    Disallow: /?q=user/password/
    Disallow: /?q=user/register/
    Disallow: /?q=user/login/
    Disallow: /?q=user/logout/
    

robots.txt

As per robotstxt.org website owners use the robots.txt file to give instructions about their site to web robots. This is called The Robots Exclusion Protocol.

It works as follows:

  • A robot wants to vist a website URL, e.g. http://www.example.com/welcome.html.
  • Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

    User-agent: *
    Disallow: /
    
    • The User-agent: * means this section applies to all robots.
    • The Disallow: / tells the robot that it should not visit any pages on the site.

There are two important considerations when using robots.txt:

  • Robots can ignore your robots.txt. Specially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • The robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

Outro

As using ChromeDriver and Chrome the navigator.webdriver defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, so that alternate code paths can be triggered during automation. Hence you are able to see more contents from the robots.txt.

You can find a relevant discussion in Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • I'm not quite sure I understand your conclusion. The browser and selenium driver are only retrieving the robots.txt file, not doing anything else. So you can't say that the robots.txt file is ignored. They retrieve it, but it's different in some cases. That's what my problem is. – kozeljko Mar 22 '19 at 09:49
  • @kozeljko Checkout my answer update and let me know your thoughts – undetected Selenium Mar 22 '19 at 10:12