I am getting some strange behaviour from my Scrapy CrawlSpider which I am at a loss to explain, any suggestions appreciated! It's configured to run from a script following alecxe's answer to this question: Scrapy Very Basic Example
The script for my CrawlSpider (sdcrawler.py
) is below. If I call this from the command line (e.g. "python sdcrawler.py 'myEGurl.com' 'http://www.myEGurl.com/testdomain' './outputfolder/' 'testdomain/'
") then LinkExtractor will follow links on the page fine and enter into the parse_item
callback function in order to process any links it finds. However if I try to call the EXACT same command with os.system()
from a Python script then for some pages (not all) the CrawlSpider does not follow any links or ever enter the parse_item
callback function. I can't seem to get any output or error messages to understand why parse_item
isn't called for these pages in this instance. The print
statements I have added confirm that __init__
is definitely called but then the spider closes. I don't understand why if I paste the "python sdcrawler.py ...
" command I was using with os.system()
into the command line and run it then parse_function
is called for the exact same arguments?
CrawlSpider code:
class SDSpider(CrawlSpider):
name = "sdcrawler"
# requires 'domain', 'start_page', 'folderpath' and 'sub_domain' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
def __init__(self):
self.allowed_domains = [sys.argv[1]]
self.start_urls = [sys.argv[2]]
self.folder = sys.argv[3]
try:
os.stat(self.folder)
except:
os.makedirs(self.folder)
sub_domain = sys.argv[4]
self.rules = [Rule(LinkExtractor(allow=sub_domain), callback='parse_item', follow=True)]
print settings['CLOSESPIDER_PAGECOUNT']
super(SDSpider, self).__init__()
def parse_item(self, response):
# check for correctly formatted HTML page, ignores crap pages and PDFs
print "entered parse_item\n"
if re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE) or 'HTML' in response.body[0:10]:
s = 1
else:
s = 0
if response.url[-4:] == '.pdf':
s = 0
if s:
filename = response.url.replace(":","_c_").replace(".","_o_").replace("/","_l_") + '.htm'
if len(filename) > 255:
filename = filename[0:220] + '_filename_too_long_' + str(datetime.datetime.now().microsecond) + '.htm'
wfilename = self.folder + filename
with open(wfilename, 'wb') as f:
f.write(response.url)
f.write('\n')
f.write(response.body)
print "i'm writing a html!\n"
print response.url+"\n"
else:
print "s is zero, not scraping\n"
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
print "spider closing\n"
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('DEPTH_LIMIT', 5)
settings.set('CLOSESPIDER_PAGECOUNT', 100)
settings.set('DOWNLOAD_DELAY', 3)
settings.set('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko)')
# breadth-first crawl (depth-first is default, comment the below 3 lines out to run depth-first)
settings.set('DEPTH_PRIORITY', 1)
settings.set('SCHEDULER_DISK_QUEUE', 'scrapy.squeue.PickleFifoDiskQueue')
settings.set('SCHEDULER_MEMORY_QUEUE', 'scrapy.squeue.FifoMemoryQueue')
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = SDSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start the reactor (blocks execution)
reactor.run()
EDIT in response to @alecxe's comment:
I'm calling sdcrawler.py with os.system() in a function called execute_spider(). The arguments are a list of URLs which contain subdomains as a .txt file and the overall domain URL that the spider should stay within whilst exploring subdomains.
Execute_spider code():
def execute_spider(SDfile, homepageurl):
folderpath = SDfile.rsplit('/',1)[0] + '/'
outputfolder = folderpath + 'htmls/'
try:
os.stat(outputfolder)
except:
os.makedirs(outputfolder)
SDsvisited = folderpath + 'SDsvisited.txt'
singlepagesvisited = folderpath + 'singlepagesvisited.txt'
# convert all_subdomains.txt to a list of strings
with open(SDfile) as f:
sdlist1 = f.readlines()
# remove duplicates from all_subdomains list
sdlist = list(set(sdlist1))
# set overall domain for this website, don't crawl outside their site (some of subdomains.txt will be external links)
domain = homepageurl
clean_domain = domain.split('.',1)[1]
# process sdlist: only keep over-arching subdomains and strip out single pages to be processed in a different way
#seenSDs = []
sdlistclean = []
singlepagelist = []
sdlist = sorted(sdlist)
for item in sdlist:
if item != '' and not item.isspace():
if '.' in item.split('/')[-1]:
if clean_domain in item:
singlepagelist.append(item)
else:
if item in sdlistclean:
pass
else:
if clean_domain in item:
sdlistclean.append(item)
# crawl cleaned subdomains and save html pages to outputfolder
for item in sdlistclean:
# check that you don't have a country multisite as your subdomain
SDchk = item.split('/')[-2]
if SDchk.isalpha() and len(SDchk) == 2 and SDchk != 'pr' and SDchk != 'PR' and SDchk != 'hr' and SDchk != 'HR':
subdomain = item.split('/')[-3]
elif re.match(r'[A-Za-z]{2}-[A-Za-z]{2}', SDchk): #SDchk == 'en-US' or SDchk == 'en-UK':
subdomain = item.split('/')[-3]
else:
subdomain = item.split('/')[-2]
cmd = 'python sdcrawler.py ' + '\'' + clean_domain + '\' ' + '\'' + item + '\' ' + '\'' + outputfolder + '\' '+ '\'' + subdomain + '/\''
print cmd
os.system(cmd)
I'm printing cmd
before I os.system(cmd)
and if I just copy this print
output and run it in a separate terminal CrawlSpider executes as I would expect, visiting links and parsing them using the parse_item
callback function.
The output of printing sys.argv
is:
['sdcrawler.py' 'example.com' 'http://example.com/testdomain/' './outputfolder/' 'testdomain/']