I'm working on a presentation on web scraping and I'm trying to explain parts of robots.txt.
Given the following section of Wikipedia's robots.txt, it appears that IsraBot
is allowed to scrape /
while Mediapartners-Google*
is not.
# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /
# Wikipedia work bots:
User-agent: IsraBot
Disallow:
This is validated by
url = 'https://en.wikipedia.org'
rfp = urllib.robotparser.RobotFileParser(f'{url}/robots.txt')
rfp.read()
bots = ["*", "Mediapartners-Google*", "IsraBot", "gsa-garmin"]
for bot in bots:
print(rfp.can_fetch(bot, f'{url}/'))
# True
# False
# True
# True
However, when I look at Garmin's robots.txt, it looks like they're pretty open for scraping. They even comment that their intent is for all bots to be able to scrape with a few exceptions.
# Allow all agents to get all stuff
User-agent: *
Disallow:
# ...except this stuff...
# pointless without POSTed form data:
Disallow: /products/comparison.jsp
# not for the general public:
Disallow: /dealerResource/*
Disallow: /lbt/*
User-agent: gsa-garmin
Allow: /
However, running the same code as above on Garmin's site, it does not appear to allow any bots to scrape.
url = 'https://www.garmin.com'
rfp = urllib.robotparser.RobotFileParser(f'{url}/robots.txt')
rfp.read()
bots = ["*", "Mediapartners-Google*", "IsraBot", "gsa-garmin"]
for bot in bots:
print(rfp.can_fetch(bot, f'{url}/'))
# False
# False
# False
# False
I guess the main question is what is the difference between the two lines below (for either Disallow or Allow)? I read this as the first says nothing is disallowed while the second says '/' is disallowed.
Disallow:
Disallow: /
I am also perplexed as to why rfp.can_fetch('gsa-garmin', 'https://www.garmin.com/')
would return False
.
Is there a difference between these lines?
Disallow:
Allow: /
Even answers to this question suggests my understanding is correct, but the code says otherwise. Also answers to this question state that the Allow
directive is a non-standard extension, but RobotFileParser.parse()
appears to support it.
And if it's an error with Python, this was performed with Python 3.7.5