For my scrapy project I have been using ImagesPipeline to download images. The images are stored with filenames that correspond to the SHA1 hash of their url names.
My Question is how can i change the names to contain the name of another scrapy field stored in item['image_name']
I have been looking at multiple previous questions including,
How can I change the scrapy download image name in pipelines?.
Scrapy image download how to use custom filename. However, I have not been able to make any of these methods work. Especially the 2017 answer since that was the closest answer to Scrapy 1.6 I could find.
From my understanding, looking at the scrapy.pipelines.images.py file is that the idea of renaming the file stems from overriding the file_path function which returns the 'full/%s.jpg' % (image_guid)
To do this I presume that the specific item container must be requested and stored in the meta data in the get_media_request function.
I am confused though as I am unclear on how this is accessing the images item field which seems to be where the path occurs in the running of the spider.
I am not sure of this process though and would really appreciate some help with the matter.
My Current Code for Pipelines.py
class ImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item['image_url']
meta = {'filename': item['image_name']}
yield Request(url=img_url, meta=meta)
def file_path(self, request, response=None, info=None):
image_guid = request.meta.get('filename', '')
return 'full/%s.jpg' % (image_guid)
The 'image_name' field is updated correctly however in 'images' field the 'path' is still a SHA1 hash of the Url
------------------------------Solution----------------------------------
The solution to this problem has been discovered. The main problem was me not understanding that to overwrite the pipeline I have to actively call it into the program. The following is the code that fixed the problem.
pipelines.py
class CustomImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [Request(x, meta={'filename': item['image_name']}) for x in item.get(self.images_urls_field, [])]
def file_path(self, request, response=None, info=None):
image_guid = request.meta.get('filename', '')
return 'full/%s.jpg' % (image_guid)
settings.py
ITEM_PIPELINES = {'basicimage.pipelines.CustomImagesPipeline': 1,}
Where basicimage is my personal project name. Following this I was able to slightly adapt the code to also be able to change the directory folder name as follows.
class CustomImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
meta = {'filename': item['image_name'], 'directoryname': item['directory']}
for x in item.get(self.images_urls_field, []):
return Request(x, meta=meta)
def file_path(self, request, response=None, info=None):
image_guid = request.meta.get('filename', '')
image_direct = request.meta.get('directoryname', '')
return '%s/%s.jpg' % (image_direct, image_guid)