0

For my scrapy project I have been using ImagesPipeline to download images. The images are stored with filenames that correspond to the SHA1 hash of their url names.

My Question is how can i change the names to contain the name of another scrapy field stored in item['image_name']

I have been looking at multiple previous questions including, How can I change the scrapy download image name in pipelines?. Scrapy image download how to use custom filename. However, I have not been able to make any of these methods work. Especially the 2017 answer since that was the closest answer to Scrapy 1.6 I could find. From my understanding, looking at the scrapy.pipelines.images.py file is that the idea of renaming the file stems from overriding the file_path function which returns the 'full/%s.jpg' % (image_guid)
To do this I presume that the specific item container must be requested and stored in the meta data in the get_media_request function. I am confused though as I am unclear on how this is accessing the images item field which seems to be where the path occurs in the running of the spider.
I am not sure of this process though and would really appreciate some help with the matter.

My Current Code for Pipelines.py

class ImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        img_url = item['image_url']
        meta = {'filename': item['image_name']}
        yield Request(url=img_url, meta=meta)

    def file_path(self, request, response=None, info=None):
        image_guid = request.meta.get('filename', '')
        return 'full/%s.jpg' % (image_guid)

The 'image_name' field is updated correctly however in 'images' field the 'path' is still a SHA1 hash of the Url
------------------------------Solution----------------------------------
The solution to this problem has been discovered. The main problem was me not understanding that to overwrite the pipeline I have to actively call it into the program. The following is the code that fixed the problem.
pipelines.py

class CustomImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return [Request(x, meta={'filename': item['image_name']}) for x in item.get(self.images_urls_field, [])]

    def file_path(self, request, response=None, info=None):
        image_guid = request.meta.get('filename', '')
        return 'full/%s.jpg' % (image_guid)

settings.py

ITEM_PIPELINES = {'basicimage.pipelines.CustomImagesPipeline': 1,}

Where basicimage is my personal project name. Following this I was able to slightly adapt the code to also be able to change the directory folder name as follows.

class CustomImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        meta = {'filename': item['image_name'], 'directoryname': item['directory']}
        for x in item.get(self.images_urls_field, []):
            return Request(x, meta=meta)

    def file_path(self, request, response=None, info=None):
        image_guid = request.meta.get('filename', '')
        image_direct = request.meta.get('directoryname', '')
        return '%s/%s.jpg' % (image_direct, image_guid)
JAIvY
  • 9
  • 2
  • I took a look at the ImagesPipeline and what you are doing looks correct. Did you remember to enable YOUR custom ImagesPipeline? Also it is better to rename it. Rename it to `CustomImagesPipeline` and add this line in your `settings.py` replacing the import path with the correct path: `ITEM_PIPELINES = {'myprojectname.pipeline.CustomImagesPipeline': 1}` – Luiz Rodrigues da Silva Jun 12 '19 at 12:24
  • I think I understand what your saying. Basically have i enabled the overwrite pipeline. I have changed the name of the function to `CustomImagesPipeline` and changed the path to `ITEM_PIPELINES = { ' basicimage.pipeline.CustomImagesPipeline': 1}` where basicimage is the name of my project. However, I now get no module named `'basicimage.pipeline'` – JAIvY Jun 12 '19 at 13:10
  • I've figured out how to recognize the module by changing the line to `ITEM_PIPELINES = {'basicimage.pipelines.CustomImagesPipeline' : 1}`. This then throws up a lot of errors, one of which is `TypeError: Request url must be str or unicode, got list:` changing the line of code so that `img_url = str(item['image_urls'])` only creates more problems it seems, so I'm not sure why this is occuring. – JAIvY Jun 12 '19 at 13:34
  • If you check the source code of [`ItemsPipeline::get_media_requests`](https://github.com/scrapy/scrapy/blob/c72ab1d4ba5dad3c68b12c473fa55b7f1f144834/scrapy/pipelines/images.py#L159) you will see that item['image_urls'] is a list of all your urls, not one per item. Try doing this: `def get_media_requests(self, item, info): return [Request(x, meta={'filename': item['image_name']}) for x in item.get(self.images_urls_field, [])]` I can write an answer instead of a comment if you want. – Luiz Rodrigues da Silva Jun 12 '19 at 14:19
  • Wow! thank you! That has fixed it fully so the names are now being populated with the `item['image_name']`. Using this code though my next thoughts would be would it be possible to adapt this so that i can call another item which could be used in the directory place of the full in `'full/%s.jpg'`. Sort of like expanding out the `get_media_requests` so that i could iterate over the for example `'directoryname'`. Would this just be a case of adding more meta data and requesting it in the `file_path`? For sure you can write an answer, I will update my question with the working code! – JAIvY Jun 12 '19 at 14:34
  • Yes, I believe is just a matter of adding a new field in the item and adding it in the meta, just like you did with the filename. – Luiz Rodrigues da Silva Jun 12 '19 at 14:41
  • Thanks for all your help! Have got it working fully now. Will post the working code after the question. – JAIvY Jun 12 '19 at 14:47

0 Answers0