0

I've created a script using python's scrapy module to download and rename movie images from a torrent site and store them in a folder within scrapy project. When I run my script as it is, I find it downloading images in that folder folder errorlessly.

At this moment the script is renaming those images using a convenient portion from request.url through pipelines.py.

How can I rename those downloaded images through pipelines.py using their movie names from the variable movie defined within get_images() method?

spider contains:

from scrapy.crawler import CrawlerProcess
import scrapy, os

class yify_sp_spider(scrapy.Spider):
    name = "yify"
    start_urls = ["https://yts.am/browse-movies"]

    custom_settings = {
        'ITEM_PIPELINES': {'yify_spider.pipelines.YifySpiderPipeline': 1},
        'IMAGES_STORE': r"C:\Users\WCS\Desktop\yify_spider\yify_spider\spiders\Images",
    }

    def parse(self, response):
        for item in response.css(".browse-movie-wrap"):
            movie_name = ''.join(item.css(".browse-movie-title::text").get().split())
            img_link = item.css("img.img-responsive::attr(src)").get()
            yield scrapy.Request(img_link, callback=self.get_images,meta={'movie':movie_name})

    def get_images(self, response):
        movie = response.meta['movie']
        yield {
            "movie":movie,
            'image_urls': [response.url],
        }

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(yify_sp_spider)
    c.start()

pipelines.py contains:

from scrapy.pipelines.images import ImagesPipeline

class YifySpiderPipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None):
        image_name = request.url.split('/')[-2]+".jpg"
        return image_name

One of such downloaded images should look like Obsession.jpg when the renaming is done.

rpanai
  • 12,515
  • 2
  • 42
  • 64
robots.txt
  • 96
  • 2
  • 10
  • 36

1 Answers1

3

Override get_media_requests() and add the data you need to the request. Then grab that data from the request in file_path().

For example:

class YifySpiderPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        # Here we add the whole item, but you can add only a single field too.
        return [Request(x, meta={'item': item) for x in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        item = request.meta.get('item')
        movie = item['movie']
        # Construct the filename.
        return image_name
malberts
  • 2,488
  • 1
  • 11
  • 16
  • I've got one little question on this @malberts. How that the very line look like if I wanted to add only a single field other than the whole item? Thanks in advance. – robots.txt Feb 20 '19 at 15:15
  • 1
    @robots.txt Replace `'item': item`. For example `meta={'movie': item['movie']}` – malberts Feb 20 '19 at 15:28
  • I hope you wll check out [this post](https://stackoverflow.com/questions/54801031/unable-to-use-proxies-one-by-one-until-there-is-a-valid-response) in case there is any solution to offer @malberts. Thanks in advance. – robots.txt Feb 23 '19 at 08:29