How to handle image filename duplication in scrapy image download

Question

Scrapy uses sha1 to generate random image filename. When duplication occurs, it will overwrite the file, causing loss of an existing image file. Is it possible to write extra code (e.g: an overriding class) to handle duplication. For instance: keep generating new random filename until duplication is not found? If yes, kindly provide code example?

--- old question: Does it check to ensure filename uniqueness for all image files under images_store folder ? Scrapy uses sha1 to generate random filename while downloading images. Sha1 provides good level of uniqueness but by probability, there is chance for duplication.

SHA1 by definition doesn't guarantee uniqueness and there is a chance for duplication. According to the [source code](https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/pipeline/images.py), scrapy will just rewrite the image if it existed with the same name. See also: http://stackoverflow.com/questions/5388781/how-safely-can-i-assume-unicity-of-a-part-of-sha1-hash and http://stackoverflow.com/questions/3060259/do-cryptographic-hashes-provide-really-unique-results. — alecxe, Jun 03 '13 at 06:37
@alecxe: thanks for the input. I have updated the question to "How to handle image filename duplication in scrapy image download" — Harry, Jun 03 '13 at 07:27

score 1 · Answer 1 · answered Jun 03 '13 at 21:41

Not sure this is the best solution, but what if you make your custom pipeline based on ImagesPipeline pipeline and override image_key method like this (though, haven't tested it):

import hashlib
import os
import random
import string
from scrapy.contrib.pipeline.images import ImagesPipeline


class CustomImagesPipeline(ImagesPipeline):
    def image_key(self, url):
        image_guid = hashlib.sha1(url).hexdigest()

        # check if image already exists and add some random char to the file name
        path_format = 'full/%s.jpg'
        while True:
            path = path_format % image_guid
            if os.path.exists(path):
                image_guid = image_guid + random.choice(string.letters)
            else:
                break

        return path

This is just an example - you may want to improve that filename change logic. Additionally, you should do the same for thumb_key method.

Hope that helps.

score 0 · Answer 2 · answered Nov 15 '16 at 03:44

You shouldn't care about it!

Scrapy uses the image url sha1. And to have a probability of 50% of finding a SHA1 collision you need about 2^80 items. So, unless you are going to crawl 2^80 images, the chances of image filename duplication is less than 50%. In fact you can crawl much more than 1 trillion images and simple ignore filename duplication because the chances are insignificant.

How to handle image filename duplication in scrapy image download

2 Answers2