I want to parse robots.txt file in python. I have explored robotParser and robotExclusionParser but nothing really satisfy my criteria. I want to fetch all the diallowedUrls and allowedUrls in a single shot rather then manually checking for each url if it is allowed or not. Is there any library to do this?
Asked
Active
Viewed 1.1k times
2
-
Can I ask what robot.txt contains, and what you mean by parse the text file? – George Willcox Mar 29 '17 at 06:19
-
robots.txt is a standard which is followed by every sitemap support. Sitemap : To make our content searchable. – Ritu Bhandari Mar 29 '17 at 06:19
-
example : https://fortune.com/robots.txt http://www.robotstxt.org/robotstxt.html – Ritu Bhandari Mar 29 '17 at 06:20
-
Ah okay that makes more sense now, maybe you should link to this in your question for others that are unfamiliar with this concept. – George Willcox Mar 29 '17 at 06:26
-
since robot.txt data is in ` – akash karothiya Mar 29 '17 at 06:26
-
@GeorgeWillcox Sure. Thanks. I was searching for a standard library to do this. – Ritu Bhandari Mar 29 '17 at 06:56
2 Answers
8
Why do you have to check your urls manually ?
You can use urllib.robotparser
in Python 3, and do something like this
import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup
url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
site = urllib.request.urlopen(url)
sauce = site.read()
soup = BeautifulSoup(sauce, "html.parser")
actual_url = site.geturl()[:site.geturl().rfind('/')]
my_list = soup.find_all("a", href=True)
for i in my_list:
# rather than != "#" you can control your list before loop over it
if i != "#":
newurl = str(actual_url)+"/"+str(i)
try:
if rp.can_fetch("*", newurl):
site = urllib.request.urlopen(newurl)
# do what you want on each authorized webpage
except:
pass
else:
print("cannot scrap")
2
You can use curl
command to read the robots.txt file into a single string split it with new line check for allow and disallow urls.
import os
result = os.popen("curl https://fortune.com/robots.txt").read()
result_data_set = {"Disallowed":[], "Allowed":[]}
for line in result.split("\n"):
if line.startswith('Allow'): # this is for allowed url
result_data_set["Allowed"].append(line.split(': ')[1].split(' ')[0]) # to neglect the comments or other junk info
elif line.startswith('Disallow'): # this is for disallowed url
result_data_set["Disallowed"].append(line.split(': ')[1].split(' ')[0]) # to neglect the comments or other junk info
print (result_data_set)

Yaman Jain
- 1,254
- 11
- 16
-
you are welcome. nope @Ritu, couldn't find which suffices your use case. May be you can extend it and build library. – Yaman Jain Mar 29 '17 at 06:56