Scrapy crawl in order

Question

I can't figure out how to make scrapy crawl links in order I've got a page with articles and in each one there is a title but the article doesn't match the title Also in settings.py I added:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

I've got something like this:

class Getgot(Spider):
    name = "getem"
    allowed_domains = ["somesite.us"]
    start_urls = ["file:local.html"]
    el = '//div[@article]'


    def parse(self,response):
        hxs = HtmlXPathSelector(response)
        s = hxs.select('//article')
        filename = ("links.txt")
        filly = open(filename, "w")
        for i in s:
            t = i.select('a/@href').extract()
            filly.write(str(t[0])+'\n') 
            yield Request(str(t[0]),callback=self.parse_page)


    def parse_page(self,res):
        hxs = HtmlXPathSelector(res)
        s = hxs.select('//iframe').extract()
        if s:
            filename = ("frames.txt")
            filly = open(filename, "a")
            filly.write(str(s[0])+'\n') 
    else:
        filename = ("/frames.txt")
        filly = open(filename, "a")
        filly.write('[]\n')

Please have a look at http://stackoverflow.com/questions/6566322/scrapy-crawl-urls-in-order — Pankaj Sharma, Jul 31 '14 at 12:52

score 0 · Answer 1 · answered Jul 31 '14 at 13:17

0

I'm not sure I understand how your question and your code are related. Where is the title ?

A few tips: 1) update your scrapy syntax with the latest version 2) don't write any files from the spider, write it in a pipeline or export feed. 3) if you need to transfer data from one function to the next, use the meta attribute.

def parse(self, response):
    for link in response.xpath("//article/a/@href").extract():
        yield Request(link, callback=self.parse_page, meta={'link':link})

def parse_page(self, response):
    for frame in response.xpath("//iframe").extract():
        item = MyItem()
        item['link'] = response.meta['link']
        item['frame'] = frame
        yield item

And then you export it to csv or json or whatever, to store the link and the frame together.

answered Jul 31 '14 at 13:17

Arthur Burkhardt

658
4
13

The title is stored in iframe but I will try to update my syntax and use meta attr – user3887640 Jul 31 '14 at 16:44
"if you need to transfer data from one function to the next, use the meta attribute." I think this is what the thread opener requires. To me it seems, he's trying to push information from `parse` to `parse_page` with a file, which goes wrong, because scrapy work parallelly. You might want to add a link to the official documentation which is even called ["passing additional data to callback functions"](http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions). – aufziehvogel Jul 31 '14 at 20:13

Scrapy crawl in order

1 Answers1