6

Im building a website which will store millions of images so i need a unique id for each image. What Cryptography is best for storing images. Right now this is what my code looks like im using SHA1.

Is there a standard hash used beside sha1 and is it possible that two images could have the same hash code?

 Image img = Image.FromFile("image.jpg");

 ImageConverter converter = new ImageConverter();
 byte[] byteArray = (byte[])converter.ConvertTo(img, typeof(byte[]));

 string hash;

 using (SHA1CryptoServiceProvidersha1 = new SHA1CryptoServiceProvider())
 {
     hash = Convert.ToBase64String(sha1.ComputeHash(byteArray));
 }
Leslie Jones
  • 540
  • 2
  • 9
  • 22
  • 7
    If you just need to assign a unique identifier to the image, why not a GUID? – David Mar 08 '15 at 00:47
  • 1
    @David I assume the OP wants unique images stored. Wouldn't make sense to have two exact same files on the server with different names. – BrunoLM Mar 08 '15 at 00:49
  • 2
    thats correct i want unique images stored – Leslie Jones Mar 08 '15 at 00:50
  • 1
    Is it possible that 2 valid images have the same hash code, yes. Is it remotely likely, no. – Scott Chamberlain Mar 08 '15 at 00:54
  • This http://stackoverflow.com/a/2480819/340760 and this http://stackoverflow.com/a/1867252/340760 . In short the probability of collision is pretty low. I've used this mechanism in a website but I had only about 20k images. – BrunoLM Mar 08 '15 at 00:56
  • Do you have security concerns of some sort in relation to these IDs? If no - than anything, even MD5 or SHA1 is perfectly fine. Otherwise start with SHA256. Side note: please do not use "encryption" as term for "hash function" - these are really different and not directly related concepts. – Alexei Levenkov Mar 08 '15 at 01:00
  • What kind of website is this and what makes you think it's possible to force uniqueness of images by using a hash? You'd have to use much more sophisticated algorithms for comparing images and even then I wouldn't rely on that 100%... – walther Mar 08 '15 at 01:03
  • Its a friendship website were users upload profile images, i want to make sure the images are stored with unique ids so they dont overwrite eachother. – Leslie Jones Mar 08 '15 at 01:05
  • 2
    @LeslieJones If you only care about unique file name then use GUID. – Brian Mar 08 '15 at 01:10
  • okay thanks, i think im going to do that :) – Leslie Jones Mar 08 '15 at 01:15
  • Depending on your expected number of items, just use any fast hash algorithm with a good length (the more items you have, the longer the hash should be, e.g. CRC32 might not be a good choice). – poke Mar 08 '15 at 01:15
  • As others have said, if you are looking to uniquely identify and generate re-name the image for disk storage, a GUID would be ideal. If you are indeed going with a GUID, you can also look into what options your data store has for GUIDs. For example, MS SQL has a `SequentialGuid` function that is ideal for primary keys as it optimizes the order for the index. – Justin Mar 08 '15 at 01:21
  • As others have said, if you are looking to uniquely identify and generate re-name the image for disk storage, a GUID would be ideal. If you are indeed going with a GUID, you can also look into what options your data store has for GUIDs. For example, MS SQL has a `SequentialGuid` function that is ideal for primary keys as it optimizes the order for the index. Hashing is a one way operation when the original text can't be returns to the original message. You can only re-hash and compare the hashes. – Justin Mar 08 '15 at 01:21
  • While a hashing function will help eliminate **identical** images, it will not do anything against **similar** images. Consider what happens, for example, when someone saves a JPG as a PNG and uploads both images. Or they resize the image and upload both. If these are issues you need to be concerned about then you should take a look at histograms. – Sam Axe Mar 08 '15 at 05:46

3 Answers3

7

If I understand correctly you want to assign an SHA1 value as a filename so you can detect whether you have that image in your collection already. I don't think this is the best approach (if you're not running a database then maybe it is) but still, if you're planning to have millions of images then (for practical reasons) just think that it's impossible for collisions to occur.

For this purpose I would not recommend SHA256 since the main two advantages (collision resistance + immunity to some theoretical attacks) are not really worth it because it's something around 10 times slower than SHA1 (and you'll be hashing a lot of fairly big files).

You shouldn't be scared about it's 128 bitlength: In order to have a 50% chance of finding a collision in 128 bits you will need to have 18446744073709600000 images in your collection (sqrt of 2^128).

Oh and I don't wanna sound conceited or anything, but hash and cryptography are too different things. In fact, I'd say that hashing is closer to code signing/digital signatures than to cryptography.

Gaspa79
  • 5,488
  • 4
  • 40
  • 63
4

You can use both mechanisms.

  1. Use a GUID as a unique file identifier (file system, database, etc.)
  2. Calculate and store an SHA1 or MD5 hash on your image and use that to check for duplicates.

So when an image is uploaded, you can use the hash to check for a possible duplicate. However, if one is found, then you can do a more deterministic check (ie. check the bytes of the files). Realistically speaking, you will probably never get a hash match without the files being the same, but this second check will determine for sure.

Then, once uniqueness is determined, use the GUID for the file identifier or reuse the existing file.

Matt Houser
  • 33,983
  • 6
  • 70
  • 88
3

Can two different images have the same hash code? Unlikely. On the other hand, can two copies of the same image have different hashes? Absolutely.

Take a lossless png, open it, and resave it as uncompressed. The pixels of both images will be identical, but the file hashes will be different.

Aside from the pixels, your images will also contain metadata fields such as geolocation, date/time, camera maker, camera model, ISO speed, focal length, etc.

So your hash will be affected by the type of compression and metadata when using the image file in its entirety.

The main question here is: What makes a picture "unique" to you?

For example, if an image is already uploaded, then I download it and wipe out the camera model or comments and re-upload it, would it be a different image to you, or is still the same as the original? How about the location field?

What if I download a lossless png and save it as a lossless tiff which will have the same pixel data?

Based on your requirements and which fields are important, you'll need to create a hash of the combination of the relevant metadata fields (if any) + the actual uncompressed pixel data of the image instead of making a hash using an image file in its entirety.

Of the standard hash algorithms provided in System.Security.Cryptography you'll probably find MD5 to be best suited to this application. But by all means play around with the different ones and see which one works best for you.

Here's a code sample that gets you a hash for the combination of metadata fields and image pixels:

public class ImageHash
{
    public string GetHash(string filePath)
    {
        using (var image = (Bitmap) Image.FromFile(filePath))
            return GetHash(image);
    }

    public string GetHash(Bitmap bitmap)
    {
        var formatter = new BinaryFormatter();

        using (var memoryStream = new MemoryStream())
        {
            var metafields = GetMetaFields(bitmap).ToArray();

            if(metafields.Any())
                formatter.Serialize(memoryStream, metafields);

            var pixelBytes = GetPixelBytes(bitmap);
            memoryStream.Write(pixelBytes, 0, pixelBytes.Length);

            using (var hashAlgorithm = GetHashAlgorithm())
            {
                memoryStream.Seek(0, SeekOrigin.Begin);
                var hash = hashAlgorithm.ComputeHash(memoryStream);
                return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
            }
        }
    }

    private static HashAlgorithm GetHashAlgorithm() => MD5.Create();

    private static byte[] GetPixelBytes(Bitmap bitmap, PixelFormat pixelFormat = PixelFormat.Format32bppRgb)
    {
        var lockedBits = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height), ImageLockMode.ReadOnly, pixelFormat);

        var bufferSize = lockedBits.Height * lockedBits.Stride;
        var buffer = new byte[bufferSize];
        Marshal.Copy(lockedBits.Scan0, buffer, 0, bufferSize);

        bitmap.UnlockBits(lockedBits);

        return buffer;
    }

    private static IEnumerable<KeyValuePair<string,string>> GetMetaFields(Image image)
    {
        string manufacturer = System.Text.Encoding.ASCII.GetString(image.PropertyItems[1].Value);

        yield return new KeyValuePair<string, string>("manufacturer", manufacturer);
        
        // return any other fields you may be interested in
    }
}

And obviously, you'd use this as:

var hash = new ImageHash().GetHash(@"some file path");

Whilst a decent start, this method has areas that can be improved on, such as:

  1. How about the same image after being resized? If that doesn't make it a different picture (as in, if you need tolerance to image resize), then you'll want to resize the input images first to a pre-determined size before hashing.

  2. How about changes in ambient light? Would that make it a different picture? If the answer is no, then you'll need take that into effect too and make the algorithm robust in the face of brightness changes, etc to still result in the same hash regardless of the image brightness having changed.

  3. How about geometric transformations? e.g., if I rotate or mirror an image before re-uploading it, is it still the same image as the original? If so, the algorithm would need to be intelligent enough to produce the same hash after those types of transformations.

  4. How would you like to handle cases where a border is added to an image? There are many such scenarios in the realm of image processing. Some of which have fairly standard solutions, while many others are still being actively worked on.

  5. Performance: this current code may prove time and resource consuming depending on the number & size of images and how much time you can afford to spend on the hashing of each image. If you need it to run faster and/or use up less memory, you may want to downsize your images to a pre-determined size before getting their hash.

Daria
  • 91
  • 5