1

I am getting some strange behaviour from my Scrapy CrawlSpider which I am at a loss to explain, any suggestions appreciated! It's configured to run from a script following alecxe's answer to this question: Scrapy Very Basic Example

The script for my CrawlSpider (sdcrawler.py) is below. If I call this from the command line (e.g. "python sdcrawler.py 'myEGurl.com' 'http://www.myEGurl.com/testdomain' './outputfolder/' 'testdomain/'") then LinkExtractor will follow links on the page fine and enter into the parse_item callback function in order to process any links it finds. However if I try to call the EXACT same command with os.system() from a Python script then for some pages (not all) the CrawlSpider does not follow any links or ever enter the parse_item callback function. I can't seem to get any output or error messages to understand why parse_item isn't called for these pages in this instance. The print statements I have added confirm that __init__ is definitely called but then the spider closes. I don't understand why if I paste the "python sdcrawler.py ..." command I was using with os.system() into the command line and run it then parse_function is called for the exact same arguments?

CrawlSpider code:

class SDSpider(CrawlSpider):
    name = "sdcrawler"

    # requires 'domain', 'start_page', 'folderpath' and 'sub_domain' to be passed as string arguments IN THIS PARTICULAR ORDER!!!
    def __init__(self):
        self.allowed_domains = [sys.argv[1]]
        self.start_urls = [sys.argv[2]]
        self.folder = sys.argv[3]
        try:
            os.stat(self.folder)
        except:
            os.makedirs(self.folder)
        sub_domain = sys.argv[4]
        self.rules = [Rule(LinkExtractor(allow=sub_domain), callback='parse_item', follow=True)]
        print settings['CLOSESPIDER_PAGECOUNT']
        super(SDSpider, self).__init__()


    def parse_item(self, response):
        # check for correctly formatted HTML page, ignores crap pages and PDFs
        print "entered parse_item\n"
        if re.search("<!\s*doctype\s*(.*?)>", response.body, re.IGNORECASE) or 'HTML' in response.body[0:10]:
            s = 1
        else:
            s = 0
        if response.url[-4:] == '.pdf':
            s = 0

        if s:
            filename = response.url.replace(":","_c_").replace(".","_o_").replace("/","_l_") + '.htm'
            if len(filename) > 255:
                filename = filename[0:220] + '_filename_too_long_' + str(datetime.datetime.now().microsecond) + '.htm'
            wfilename = self.folder + filename
            with open(wfilename, 'wb') as f:
                f.write(response.url)
                f.write('\n')
                f.write(response.body)
                print "i'm writing a html!\n"
                print response.url+"\n"
        else:
            print "s is zero, not scraping\n"

# callback fired when the spider is closed
def callback(spider, reason):
    stats = spider.crawler.stats.get_stats()  # collect/log stats?

    # stop the reactor
    reactor.stop()
    print "spider closing\n"


# instantiate settings and provide a custom configuration
settings = Settings()

settings.set('DEPTH_LIMIT', 5)
settings.set('CLOSESPIDER_PAGECOUNT', 100)
settings.set('DOWNLOAD_DELAY', 3)
settings.set('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko)')

# breadth-first crawl (depth-first is default, comment the below 3 lines out to run depth-first)
settings.set('DEPTH_PRIORITY', 1)
settings.set('SCHEDULER_DISK_QUEUE', 'scrapy.squeue.PickleFifoDiskQueue')
settings.set('SCHEDULER_MEMORY_QUEUE', 'scrapy.squeue.FifoMemoryQueue')

# instantiate a crawler passing in settings
crawler = Crawler(settings)

# instantiate a spider
spider = SDSpider()

# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)

# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()

# start the reactor (blocks execution)
reactor.run()

EDIT in response to @alecxe's comment:

I'm calling sdcrawler.py with os.system() in a function called execute_spider(). The arguments are a list of URLs which contain subdomains as a .txt file and the overall domain URL that the spider should stay within whilst exploring subdomains.

Execute_spider code():

def execute_spider(SDfile, homepageurl):

    folderpath = SDfile.rsplit('/',1)[0] + '/' 
    outputfolder = folderpath + 'htmls/'
    try:
        os.stat(outputfolder)
    except:
        os.makedirs(outputfolder)

    SDsvisited = folderpath + 'SDsvisited.txt'
    singlepagesvisited = folderpath + 'singlepagesvisited.txt'

    # convert all_subdomains.txt to a list of strings
    with open(SDfile) as f:
        sdlist1 = f.readlines()

    # remove duplicates from all_subdomains list
    sdlist = list(set(sdlist1))

    # set overall domain for this website, don't crawl outside their site (some of subdomains.txt will be external links)
    domain = homepageurl
    clean_domain = domain.split('.',1)[1]

    # process sdlist: only keep over-arching subdomains and strip out single pages to be processed in a different way 
    #seenSDs = []
    sdlistclean = []
    singlepagelist = []
    sdlist = sorted(sdlist)

    for item in sdlist:
        if item != '' and not item.isspace():
            if '.' in item.split('/')[-1]:
                if clean_domain in item:
                    singlepagelist.append(item)
            else:
                if item in sdlistclean:
                    pass
                else:
                    if clean_domain in item:
                        sdlistclean.append(item)

    # crawl cleaned subdomains and save html pages to outputfolder
    for item in sdlistclean:

        # check that you don't have a country multisite as your subdomain
        SDchk = item.split('/')[-2]
        if SDchk.isalpha() and len(SDchk) == 2 and SDchk != 'pr' and SDchk != 'PR' and SDchk != 'hr' and SDchk != 'HR':
            subdomain =  item.split('/')[-3]

        elif re.match(r'[A-Za-z]{2}-[A-Za-z]{2}', SDchk): #SDchk == 'en-US' or SDchk == 'en-UK':
            subdomain =  item.split('/')[-3]
        else:
            subdomain = item.split('/')[-2]

        cmd = 'python sdcrawler.py ' + '\'' + clean_domain +  '\' ' + '\'' + item  + '\' ' + '\'' + outputfolder + '\' '+ '\'' + subdomain + '/\''

        print cmd
        os.system(cmd)

I'm printing cmd before I os.system(cmd) and if I just copy this print output and run it in a separate terminal CrawlSpider executes as I would expect, visiting links and parsing them using the parse_item callback function.

The output of printing sys.argv is:

['sdcrawler.py' 'example.com' 'http://example.com/testdomain/' './outputfolder/' 'testdomain/']
Community
  • 1
  • 1
AdO
  • 445
  • 1
  • 6
  • 17
  • Can you show how do you use `os.system()` to run the script? Also, if you print out the `sys.argv` - what do you get? Thanks. – alecxe Aug 18 '15 at 15:37
  • Is it always the same domains / `os.system()` calls that are problematic, or is it happening randomly? – Rejected Aug 18 '15 at 20:46

0 Answers0