0

I've created a script using python's scrapy module to download and rename movie images from multiple pages out of a torrent site and store them in a desktop folder. When it is about downloading and storing those images in a desktop folder, my script is the same errorlessly. However, what I'm struggling to do now is rename those files on the fly. As I didn't make use of item.py file and I do not wish to either, I hardly understand how the logic of pipelines.py file would be to handle the renaming process.

My spider (It downloads the images flawlessly):

from scrapy.crawler import CrawlerProcess
import scrapy, os

class YifySpider(scrapy.Spider):
    name = "yify"

    allowed_domains = ["www.yify-torrent.org"]
    start_urls = ["https://www.yify-torrent.org/search/1080p/p-{}/".format(page) for page in range(1,5)]

    custom_settings = {
        'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
        'IMAGES_STORE': r"C:\Users\WCS\Desktop\Images",
    }

    def parse(self, response):
        for link in response.css("article.img-item .poster-thumb::attr(src)").extract():
            img_link = response.urljoin(link)
            yield scrapy.Request(img_link, callback=self.get_images)

    def get_images(self, response):
        yield {
            'image_urls': [response.url],
        }

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(YifySpider)
    c.start()

pipelines.py contains: (the following lines are the placeholders to let you know I at least tried):

from scrapy.http import Request

class YifyPipeline(object):

    def file_path(self, request, response=None, info=None):
        image_name = request.url.split('/')[-1]
        return image_name

    def get_media_requests(self, item, info):
        yield Request(item['image_urls'][0], meta=item)

How can I rename the images through pipelines.py without the usage of item.py?

robots.txt
  • 96
  • 2
  • 10
  • 36

1 Answers1

4

You need to subclass the original ImagesPipeline:

from scrapy.pipelines.images import ImagesPipeline

class YifyPipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None):
        image_name = request.url.split('/')[-1]
        return image_name

And then refer to it in your settings:

custom_settings = {
    'ITEM_PIPELINES': {'my_project.pipelines.YifyPipeline': 1},
}

But be aware that a simple "use the exact filename" idea will cause issues when different files have the same name, unless you add a unique folder structure or an additional component to the filename. That's one reason checksums-based filenames are used by default. Refer to the original file_path, in case you want to include some of the original logic to prevent that.

malberts
  • 2,488
  • 1
  • 11
  • 16
  • Finally, I found it working. Your solution is flawless but I was having error only because I was running my script using `CrawlerProcess`. So, to make it work as it is I needed to use this line `import sys; sys.path.append(r'C:\Users\WCS\Desktop\yify_spider')` on top of my spider which lead to the `scrapy.cfg`. However, do you know any alternative other than this weird import when it is about running script using `CrawlerProcess`?. Thanks. – robots.txt Feb 18 '19 at 20:43
  • @robots.txt I'm not sure why that is necessary in the first place. There might be something weird about your file structure or where you run your script. – malberts Feb 19 '19 at 06:56
  • You might wanna check out [this post](https://stackoverflow.com/questions/54773331/trouble-renaming-downloaded-images-in-a-customized-manner-through-pipelines) to offer any solution @malberts. Thanks in advance. – robots.txt Feb 19 '19 at 19:13