Can two different images have the same hash code? Unlikely. On the other hand, can two copies of the same image have different hashes? Absolutely.
Take a lossless png, open it, and resave it as uncompressed. The pixels of both images will be identical, but the file hashes will be different.
Aside from the pixels, your images will also contain metadata fields such as geolocation, date/time, camera maker, camera model, ISO speed, focal length, etc.
So your hash will be affected by the type of compression and metadata when using the image file in its entirety.
The main question here is: What makes a picture "unique" to you?
For example, if an image is already uploaded, then I download it and wipe out the camera model or comments and re-upload it, would it be a different image to you, or is still the same as the original? How about the location field?
What if I download a lossless png and save it as a lossless tiff which will have the same pixel data?
Based on your requirements and which fields are important, you'll need to create a hash of the combination of the relevant metadata fields (if any) + the actual uncompressed pixel data of the image instead of making a hash using an image file in its entirety.
Of the standard hash algorithms provided in System.Security.Cryptography
you'll probably find MD5 to be best suited to this application. But by all means play around with the different ones and see which one works best for you.
Here's a code sample that gets you a hash for the combination of metadata fields and image pixels:
public class ImageHash
{
public string GetHash(string filePath)
{
using (var image = (Bitmap) Image.FromFile(filePath))
return GetHash(image);
}
public string GetHash(Bitmap bitmap)
{
var formatter = new BinaryFormatter();
using (var memoryStream = new MemoryStream())
{
var metafields = GetMetaFields(bitmap).ToArray();
if(metafields.Any())
formatter.Serialize(memoryStream, metafields);
var pixelBytes = GetPixelBytes(bitmap);
memoryStream.Write(pixelBytes, 0, pixelBytes.Length);
using (var hashAlgorithm = GetHashAlgorithm())
{
memoryStream.Seek(0, SeekOrigin.Begin);
var hash = hashAlgorithm.ComputeHash(memoryStream);
return BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
}
}
}
private static HashAlgorithm GetHashAlgorithm() => MD5.Create();
private static byte[] GetPixelBytes(Bitmap bitmap, PixelFormat pixelFormat = PixelFormat.Format32bppRgb)
{
var lockedBits = bitmap.LockBits(new Rectangle(0, 0, bitmap.Width, bitmap.Height), ImageLockMode.ReadOnly, pixelFormat);
var bufferSize = lockedBits.Height * lockedBits.Stride;
var buffer = new byte[bufferSize];
Marshal.Copy(lockedBits.Scan0, buffer, 0, bufferSize);
bitmap.UnlockBits(lockedBits);
return buffer;
}
private static IEnumerable<KeyValuePair<string,string>> GetMetaFields(Image image)
{
string manufacturer = System.Text.Encoding.ASCII.GetString(image.PropertyItems[1].Value);
yield return new KeyValuePair<string, string>("manufacturer", manufacturer);
// return any other fields you may be interested in
}
}
And obviously, you'd use this as:
var hash = new ImageHash().GetHash(@"some file path");
Whilst a decent start, this method has areas that can be improved on, such as:
How about the same image after being resized? If that doesn't make it a different picture (as in, if you need tolerance to image resize), then you'll want to resize the input images first to a pre-determined size before hashing.
How about changes in ambient light? Would that make it a different picture? If the answer is no, then you'll need take that into effect too and make the algorithm robust in the face of brightness changes, etc to still result in the same hash regardless of the image brightness having changed.
How about geometric transformations? e.g., if I rotate or mirror an image before re-uploading it, is it still the same image as the original? If so, the algorithm would need to be intelligent enough to produce the same hash after those types of transformations.
How would you like to handle cases where a border is added to an image? There are many such scenarios in the realm of image processing. Some of which have fairly standard solutions, while many others are still being actively worked on.
Performance: this current code may prove time and resource consuming depending on the number & size of images and how much time you can afford to spend on the hashing of each image. If you need it to run faster and/or use up less memory, you may want to downsize your images to a pre-determined size before getting their hash.