3

I'm working on a presentation on web scraping and I'm trying to explain parts of robots.txt.

Given the following section of Wikipedia's robots.txt, it appears that IsraBot is allowed to scrape / while Mediapartners-Google* is not.

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

This is validated by

url = 'https://en.wikipedia.org'
rfp = urllib.robotparser.RobotFileParser(f'{url}/robots.txt')
rfp.read()

bots = ["*", "Mediapartners-Google*", "IsraBot", "gsa-garmin"]
for bot in bots:
    print(rfp.can_fetch(bot, f'{url}/'))
# True
# False
# True
# True

However, when I look at Garmin's robots.txt, it looks like they're pretty open for scraping. They even comment that their intent is for all bots to be able to scrape with a few exceptions.

# Allow all agents to get all stuff
User-agent:  *
Disallow:

# ...except this stuff...

# pointless without POSTed form data:
Disallow: /products/comparison.jsp

# not for the general public:
Disallow: /dealerResource/*
Disallow: /lbt/*

User-agent: gsa-garmin
Allow: /

However, running the same code as above on Garmin's site, it does not appear to allow any bots to scrape.

url = 'https://www.garmin.com'
rfp = urllib.robotparser.RobotFileParser(f'{url}/robots.txt')
rfp.read()

bots = ["*", "Mediapartners-Google*", "IsraBot", "gsa-garmin"]
for bot in bots:
    print(rfp.can_fetch(bot, f'{url}/'))
# False
# False
# False
# False

I guess the main question is what is the difference between the two lines below (for either Disallow or Allow)? I read this as the first says nothing is disallowed while the second says '/' is disallowed.

Disallow:
Disallow: /

I am also perplexed as to why rfp.can_fetch('gsa-garmin', 'https://www.garmin.com/') would return False.

Is there a difference between these lines?

Disallow: 
Allow: /

Even answers to this question suggests my understanding is correct, but the code says otherwise. Also answers to this question state that the Allow directive is a non-standard extension, but RobotFileParser.parse() appears to support it.

And if it's an error with Python, this was performed with Python 3.7.5

Cohan
  • 4,384
  • 2
  • 22
  • 40

0 Answers0