How to parse iFrame content in crawl spider using Python

Question

I successfully parse the main of the website but the callback doesn't call second function, so i am not getting iframe's data. The website is https://www.farfeshplus.com/Video.asp?ZoneID=297

The spider code is as follow:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule





    class YuSpider(CrawlSpider):
        name = 'yu'
        allowed_domains = ['farfeshplus.com']
        start_urls = ['https://www.farfeshplus.com/Video.asp?ZoneID=297']

        rules = (
            Rule(LinkExtractor(restrict_xpaths='//td[@class="text6"]'), callback='parse_item', follow=True),

        )

        def parse_item(self, response):
            for url in response.xpath('//html'):
                yield {
                    'NAME': url.xpath('//h1/div/text()').extract(),
                    }
                frames = url.xpath('//iframe[@width="750"]/@src').extract_first()

                yield scrapy.Request(url=frames, callback=self.parse_frame)

        def parse_frame(self, response):
            for f in response.xpath('//div[@class="rmp-content"]/video'):
                yield {
                    'URL': f.xpath('//div[@class="rmp-content"]/video/@src').extract(),

                }

Tried changing the name of parse_item to parse_start_url but no luck.

Does `parse_item` get called? Does the `for` loop inside get any iteration? What’s the value of `frames` in those iterations? — Gallaecio, Oct 24 '19 at 10:40
parse_item gets called, displays names. frames variable also gets its value. It just doesn't call parse_frame — Ibtsam Ch, Oct 24 '19 at 10:48

Stefano Fiorucci - anakin87 · Accepted Answer · 2019-10-24T12:29:10.900

If I understood well your purpose, I would try to modify your code in this way:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule





class YuSpider(CrawlSpider):
    name = 'yu'
    allowed_domains = ['farfeshplus.com']
    start_urls = ['https://www.farfeshplus.com/Video.asp?ZoneID=297']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//td[@class="text6"]'), callback='parse_item', follow=True),

    )

    def parse_item(self, response):
        for url in response.xpath('//html'):

            response.meta['NAME']=url.xpath('//h1/div/text()').extract()
            frames = url.xpath('//iframe[@width="750"]/@src').extract_first()

            yield scrapy.Request(url=frames, callback=self.parse_frame, meta=response.meta)

    def parse_frame(self, response):
        name=response.meta['NAME']
        for f in response.xpath('//div[@class="rmp-content"]/video'):
            yield {
                'URL': f.xpath('//div[@class="rmp-content"]/video/@src').extract(),
                'NAME':name

            }

I think that the problem is related to two yield instructions in parse_frame (read this). So, I use response.meta to pass the name between parse_item and parse_frame methods.

How to parse iFrame content in crawl spider using Python

1 Answers1