I'm parsing a big XML file with scrapy and yield a request and an item on each node (I am intersted in) from the XML.
What happens now is, that the XML file is first completely processed and the items I yield successfully pass my item pipelines and after this is done, scrapy starts to process all the requests I yielded along with the items.
What I want is, that scrapy executes the requests immedeately when I yield them and not after having parsed the whole XML.
Scrapy doesn't seem to be doing DFO
This answers suggests to use
# change to breadth first
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
which I did but unfortunately didn't bring any effect (i.e. requests aren't executed immedeately either).
Logfile looks like this: (filtered for items)
2015-04-28 12:37:15+0200 [scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-04-28 12:38:15+0200 [scraper] INFO: Crawled 2 pages (at 2 pages/min), scraped 6375 items (at 6375 items/min)
2015-04-28 12:39:15+0200 [scraper] INFO: Crawled 2 pages (at 0 pages/min), scraped 11619 items (at 5244 items/min)
2015-04-28 12:40:15+0200 [scraper] INFO: Crawled 2 pages (at 0 pages/min), scraped 14302 items (at 2683 items/min)
The code looks something like this:
class XMLSpider(scrapy.Spider):
start_urls = [big_xml_url]
def parse(self, response):
with tempfile.TemporaryFile() as tmpfile:
tmpfile.write(response.body)
tmpfile.seek(0)
for event, elem in ET.iterparse(tmpfile, ('end',)):
if elem.tag == 'interesting_tag':
ret = self.parse_node(response, elem)
try:
for item in ret:
yield item
except TypeError:
if ret: # check if item is not None
yield ret
elem.clear()
def parse_node(self, response, elem):
yield Item(elem.find(), elem.find())
yield Request(url=elem.find(urlfield), callback=self.parse_detail)
def parse_detail(self, response):
# This code is executed in the very end unfortunately...