Parsing Robots.txt in python

Question

I want to parse robots.txt file in python. I have explored robotParser and robotExclusionParser but nothing really satisfy my criteria. I want to fetch all the diallowedUrls and allowedUrls in a single shot rather then manually checking for each url if it is allowed or not. Is there any library to do this?

Can I ask what robot.txt contains, and what you mean by parse the text file? — George Willcox, Mar 29 '17 at 06:19
robots.txt is a standard which is followed by every sitemap support. Sitemap : To make our content searchable. — Ritu Bhandari, Mar 29 '17 at 06:19
example : https://fortune.com/robots.txt http://www.robotstxt.org/robotstxt.html — Ritu Bhandari, Mar 29 '17 at 06:20
Ah okay that makes more sense now, maybe you should link to this in your question for others that are unfamiliar with this concept. — George Willcox, Mar 29 '17 at 06:26
since robot.txt data is in `
` tag, you cannot use html parse here, there is an alternate option `disallow = [ i for i in data.split('\n') if 'Disallow' in i]` — akash karothiya, Mar 29 '17 at 06:26
@GeorgeWillcox Sure. Thanks. I was searching for a standard library to do this. — Ritu Bhandari, Mar 29 '17 at 06:56

score 8 · Answer 1 · edited May 04 '20 at 06:44

Why do you have to check your urls manually ? You can use urllib.robotparser in Python 3, and do something like this

import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup


url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    site = urllib.request.urlopen(url)
    sauce = site.read()
    soup = BeautifulSoup(sauce, "html.parser")
    actual_url = site.geturl()[:site.geturl().rfind('/')]

    my_list = soup.find_all("a", href=True)
    for i in my_list:
        # rather than != "#" you can control your list before loop over it
        if i != "#":
            newurl = str(actual_url)+"/"+str(i)
            try:
                if rp.can_fetch("*", newurl):
                    site = urllib.request.urlopen(newurl)
                    # do what you want on each authorized webpage
            except:
                pass
else:
    print("cannot scrap")

score 2 · Answer 2 · answered Mar 29 '17 at 06:40

You can use curl command to read the robots.txt file into a single string split it with new line check for allow and disallow urls.

import os
result = os.popen("curl https://fortune.com/robots.txt").read()
result_data_set = {"Disallowed":[], "Allowed":[]}

for line in result.split("\n"):
    if line.startswith('Allow'):    # this is for allowed url
        result_data_set["Allowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info
    elif line.startswith('Disallow'):    # this is for disallowed url
        result_data_set["Disallowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info

print (result_data_set)

you are welcome. nope @Ritu, couldn't find which suffices your use case. May be you can extend it and build library. — Yaman Jain, Mar 29 '17 at 06:56

Parsing Robots.txt in python

2 Answers2

Linked