2

Scrapy uses sha1 to generate random image filename. When duplication occurs, it will overwrite the file, causing loss of an existing image file. Is it possible to write extra code (e.g: an overriding class) to handle duplication. For instance: keep generating new random filename until duplication is not found? If yes, kindly provide code example?

--- old question: Does it check to ensure filename uniqueness for all image files under images_store folder ? Scrapy uses sha1 to generate random filename while downloading images. Sha1 provides good level of uniqueness but by probability, there is chance for duplication.

Harry
  • 570
  • 2
  • 10
  • 19
  • 1
    SHA1 by definition doesn't guarantee uniqueness and there is a chance for duplication. According to the [source code](https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/pipeline/images.py), scrapy will just rewrite the image if it existed with the same name. See also: http://stackoverflow.com/questions/5388781/how-safely-can-i-assume-unicity-of-a-part-of-sha1-hash and http://stackoverflow.com/questions/3060259/do-cryptographic-hashes-provide-really-unique-results. – alecxe Jun 03 '13 at 06:37
  • @alecxe: thanks for the input. I have updated the question to "How to handle image filename duplication in scrapy image download" – Harry Jun 03 '13 at 07:27

2 Answers2

1

Not sure this is the best solution, but what if you make your custom pipeline based on ImagesPipeline pipeline and override image_key method like this (though, haven't tested it):

import hashlib
import os
import random
import string
from scrapy.contrib.pipeline.images import ImagesPipeline


class CustomImagesPipeline(ImagesPipeline):
    def image_key(self, url):
        image_guid = hashlib.sha1(url).hexdigest()

        # check if image already exists and add some random char to the file name
        path_format = 'full/%s.jpg'
        while True:
            path = path_format % image_guid
            if os.path.exists(path):
                image_guid = image_guid + random.choice(string.letters)
            else:
                break

        return path

This is just an example - you may want to improve that filename change logic. Additionally, you should do the same for thumb_key method.

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

You shouldn't care about it!

Scrapy uses the image url sha1. And to have a probability of 50% of finding a SHA1 collision you need about 2^80 items. So, unless you are going to crawl 2^80 images, the chances of image filename duplication is less than 50%. In fact you can crawl much more than 1 trillion images and simple ignore filename duplication because the chances are insignificant.

Djunzu
  • 498
  • 2
  • 12