Scrapy image download how to use custom filename

Question

For my scrapy project I'm currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.

How can I store the files using my own custom file names instead?

What if my custom file name needs to contain another scraped field from the same item? e.g. use the item['desc'] and the filename for the image with item['image_url']. If I understand correctly, that would involve somehow accessing the other item fields from the Image Pipeline.

Any help will be appreciated.

score 17 · Answer 1 · edited Feb 23 '21 at 12:05

This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        #item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + response.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
        for image in item['images']:
            yield Request(image)

`return (Request(image) for image in item['images'])` for your one-liner. — tiao, Jan 07 '15 at 16:25

score 12 · Answer 2 · edited Sep 04 '12 at 17:36

12

In scrapy 0.12 I solved something like this

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def image_key(self, url):
        image_guid = url.split('/')[-1]
        return 'full/%s.jpg' % (image_guid)

    #Name thumbnail version
    def thumb_key(self, url, thumb_id):
        image_guid = thumb_id + url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        yield Request(item['images'])

edited Sep 04 '12 at 17:36

Matt Luongo

14,371
6
53
64

answered Sep 08 '11 at 13:35

Ivan Saltikov

121
1
2

4

A small note: `ImagesPipeline.image_key(url)` and `file_key(url)` methods are deprecated, please use `file_path(request, response=None, info=None)` instead. 'scrapy/contrib/pipeline/images.py' – sumid Mar 07 '14 at 19:45

score 9 · Answer 3 · edited Jan 22 '17 at 11:01

9

I found my way in 2017,scrapy 1.1.3

def file_path(self, request, response=None, info=None):
    return request.meta.get('filename','')

def get_media_requests(self, item, info):
    img_url = item['img_url']
    meta = {'filename': item['name']}
    yield Request(url=img_url, meta=meta)

like the code above,you can add the name you want to a Request meta in get_media_requests(), and get it back in file_path() by request.meta.get('yourname','').

edited Jan 22 '17 at 11:01

Donald Duck

8,409
22
75
99

answered Jan 22 '17 at 09:55

Tarjintor

585
4
15

This is quite nice. How would one change `IMAGES_STORE` per `Item`? – Evren Yurtesen Apr 09 '18 at 12:53

score 8 · Accepted Answer · answered Jun 01 '11 at 04:11

This was the way I solved the problem in Scrapy 0.10 . Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key

class FSImagesStoreChangeableDirectory(FSImagesStore):

    def persist_image(self, key, image, buf, info,append_path):

        absolute_path = self._get_filesystem_path(append_path+'/'+key)
        self._mkdir(os.path.dirname(absolute_path), info)
        image.save(absolute_path)

class ProjectPipeline(ImagesPipeline):

    def __init__(self):
        super(ImagesPipeline, self).__init__()
        store_uri = settings.IMAGES_STORE
        if not store_uri:
            raise NotConfigured
        self.store = FSImagesStoreChangeableDirectory(store_uri)

Thanks for this. Do you have experience in using the Image expiration(http://doc.scrapy.org/topics/images.html#image-expiration) feature, and if so does this code affect it? — fortuneRice, Jun 01 '11 at 20:16
I dont have experience. I check the scrapy source code. the expiration should continue to work. If you see that expiration isn't working please tell me — llazzaro, Jun 02 '11 at 01:38

score 2 · Answer 5 · answered Feb 21 '14 at 16:15

I did a nasty quick hack for that. In my case, I stored the title of image in my feeds. And, I had only 1 image_urls per item, so, I wrote the following script. It basically renames the image files in the /images/full/ directory with the corresponding title in the item feed that I had stored in as json.

import os
import json

img_dir = os.path.join(os.getcwd(), 'images\\full')
item_dir = os.path.join(os.getcwd(), 'data.json')

with open(item_dir, 'r') as item_json:
    items = json.load(item_json)

for item in items:
    if len(item['images']) > 0:
        cur_file = item['images'][0]['path'].split('/')[-1]
        cur_format = cur_file.split('.')[-1]
        new_title = item['title']+'.%s'%cur_format
        file_path = os.path.join(img_dir, cur_file)
        os.rename(file_path, os.path.join(img_dir, new_title))

It's nasty & not recommended. But, it is a naive alternative approach.

score 0 · Answer 6 · answered Apr 07 '15 at 21:19

I rewrite the code, changing, in thumb_path def, "response." by "request.". If no, it won't work because "response is set to None".

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        #item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)
        for image in item['images']:
            yield Request(image)

Scrapy image download how to use custom filename

6 Answers6

Linked

Related