Scrapy doesn't execute requests while processing XML

Question

I'm parsing a big XML file with scrapy and yield a request and an item on each node (I am intersted in) from the XML.

What happens now is, that the XML file is first completely processed and the items I yield successfully pass my item pipelines and after this is done, scrapy starts to process all the requests I yielded along with the items.

What I want is, that scrapy executes the requests immedeately when I yield them and not after having parsed the whole XML.

Scrapy doesn't seem to be doing DFO

This answers suggests to use

# change to breadth first
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

which I did but unfortunately didn't bring any effect (i.e. requests aren't executed immedeately either).

Logfile looks like this: (filtered for items)

2015-04-28 12:37:15+0200 [scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-04-28 12:38:15+0200 [scraper] INFO: Crawled 2 pages (at 2 pages/min), scraped 6375 items (at 6375 items/min)
2015-04-28 12:39:15+0200 [scraper] INFO: Crawled 2 pages (at 0 pages/min), scraped 11619 items (at 5244 items/min)
2015-04-28 12:40:15+0200 [scraper] INFO: Crawled 2 pages (at 0 pages/min), scraped 14302 items (at 2683 items/min)

The code looks something like this:

class XMLSpider(scrapy.Spider):
  start_urls = [big_xml_url]
  def parse(self, response):
        with tempfile.TemporaryFile() as tmpfile:
            tmpfile.write(response.body)
            tmpfile.seek(0)
            for event, elem in ET.iterparse(tmpfile, ('end',)):  
                if elem.tag == 'interesting_tag':
                    ret = self.parse_node(response, elem)
                    try:
                        for item in ret: 
                            yield item
                    except TypeError:
                        if ret: # check if item is not None
                            yield ret

                    elem.clear()
  def parse_node(self, response, elem):
    yield Item(elem.find(), elem.find())
    yield Request(url=elem.find(urlfield), callback=self.parse_detail)
  def parse_detail(self, response):
    # This code is executed in the very end unfortunately...

Why does it matter in which order they are executed? The process is asynchronous no purpose, but all requests will get executed when best suited. — bosnjak, Apr 28 '15 at 17:30
It does matter because parsing the XML (and yielding the items) costs around 7 hours. These are 7 hours in which the server is not hit and I want to hit it as balanced if possible. In fact, right now the crawler needs these 7 hours more than it should. — moritzschaefer, Apr 29 '15 at 07:09

score 0 · Answer 1 · answered Jul 03 '15 at 05:56

Well, why not try to return both Item and Request from the parse_node method? In this case you know the first element is an Item or None, the second is a Request or None. If the request is not None you yield the request for further execution and save the items in a collection and yield them when the for loop ends.

Alternatively you could manually download the URL you find and call parse_detail with the result (which of course won't be a Scrapy Response).

Scrapy doesn't execute requests while processing XML

1 Answers1