2

I'm parsing a big XML file with scrapy and yield a request and an item on each node (I am intersted in) from the XML.

What happens now is, that the XML file is first completely processed and the items I yield successfully pass my item pipelines and after this is done, scrapy starts to process all the requests I yielded along with the items.

What I want is, that scrapy executes the requests immedeately when I yield them and not after having parsed the whole XML.

Scrapy doesn't seem to be doing DFO

This answers suggests to use

# change to breadth first
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

which I did but unfortunately didn't bring any effect (i.e. requests aren't executed immedeately either).

Logfile looks like this: (filtered for items)

2015-04-28 12:37:15+0200 [scraper] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-04-28 12:38:15+0200 [scraper] INFO: Crawled 2 pages (at 2 pages/min), scraped 6375 items (at 6375 items/min)
2015-04-28 12:39:15+0200 [scraper] INFO: Crawled 2 pages (at 0 pages/min), scraped 11619 items (at 5244 items/min)
2015-04-28 12:40:15+0200 [scraper] INFO: Crawled 2 pages (at 0 pages/min), scraped 14302 items (at 2683 items/min)

The code looks something like this:

class XMLSpider(scrapy.Spider):
  start_urls = [big_xml_url]
  def parse(self, response):
        with tempfile.TemporaryFile() as tmpfile:
            tmpfile.write(response.body)
            tmpfile.seek(0)
            for event, elem in ET.iterparse(tmpfile, ('end',)):  
                if elem.tag == 'interesting_tag':
                    ret = self.parse_node(response, elem)
                    try:
                        for item in ret: 
                            yield item
                    except TypeError:
                        if ret: # check if item is not None
                            yield ret

                    elem.clear()
  def parse_node(self, response, elem):
    yield Item(elem.find(), elem.find())
    yield Request(url=elem.find(urlfield), callback=self.parse_detail)
  def parse_detail(self, response):
    # This code is executed in the very end unfortunately...
Community
  • 1
  • 1
moritzschaefer
  • 681
  • 1
  • 8
  • 18
  • Why does it matter in which order they are executed? The process is asynchronous no purpose, but all requests will get executed when best suited. – bosnjak Apr 28 '15 at 17:30
  • It does matter because parsing the XML (and yielding the items) costs around 7 hours. These are 7 hours in which the server is not hit and I want to hit it as balanced if possible. In fact, right now the crawler needs these 7 hours more than it should. – moritzschaefer Apr 29 '15 at 07:09
  • 1
    Can you show your spider code? A minimal example. – bosnjak Apr 29 '15 at 10:42
  • @Lawrence any further idea with this? – moritzschaefer Jun 02 '15 at 14:01

1 Answers1

0

Well, why not try to return both Item and Request from the parse_node method? In this case you know the first element is an Item or None, the second is a Request or None. If the request is not None you yield the request for further execution and save the items in a collection and yield them when the for loop ends.

Alternatively you could manually download the URL you find and call parse_detail with the result (which of course won't be a Scrapy Response).

GHajba
  • 3,665
  • 5
  • 25
  • 35