2

I am new on this, and my objection is to build some web application that implement the user to store an image on a database as a storage, and all I want is to reduce if there is a couple or some image that stored twice or more.

So, all I need is how to find duplicate or similar images that already stored on a database, or even better when the user try to import it on the first step, and if their image are similar with an images that already been stored on a database, the system can gave a warn not to store that image.

I just want to develop how to find some similar or duplicate image on a specific directory on a database. Can you give me some explanation from the first about how to build it, and what should I learn to accomplished this from the basic step, like a tutorial or something. I'd like to learn a lot, if it's possible.

Thanks in advance, I really need this help, thanks.

AdityaSetyadi
  • 161
  • 1
  • 2
  • 18

1 Answers1

1

The solution for finding similar images is much more complex so I will stick to the finding duplicate images first. The easiest thing to do is to take a SHA1 hash of image bits. Here is some code in C# to accomplish this (see below). As for storing the hash in a database, I would recommend that you use a binary(20) datatype to store the results of the hash. This allows your SQL server to index and query much faster than storing this hash as a string or some other format.

private static byte[] GetHashCodeForFile(string file)
{
    int maxNumberOfBytesToUse = 3840000;

    using (Stream sr = File.OpenRead(file))
    {
        byte[] buffer = (sr.Length > maxNumberOfBytesToUse) ? new byte[maxNumberOfBytesToUse]: new byte[sr.Length];

        int bytesToReadIn = (sr.Length < maxNumberOfBytesToUse) ? (int)sr.Length : maxNumberOfBytesToUse;

        sr.Read(buffer, 0, bytesToReadIn);
        System.Security.Cryptography.HashAlgorithm hasher = System.Security.Cryptography.SHA1.Create();
        byte[] hashCode = hasher.ComputeHash(buffer);
        return hashCode;

    }
}

Searching for similar images is a difficult problem currently undergoing much research. And it kind of depends on how you define similar. Some prominent methods for finding similar images are:

  • Check the metadata (EXIF or similar) tags in the image file for creation date, similar images can be taken at times that are similar to each other. This may not be the best thing for what you want.
  • Calculate the relative historgram of both images and compare them for deltas in each color channel. This has the benefit of allowing an SQL query to be written and is invariant to image size. An image that has been converted to a thumbnail will be found with this method.
  • Performing an image subtraction between two images and seeing how close the image gets to pure black (all zeros). I don't know of a method to do this with a TSQL query and this code can get tricky with images that need to be resized.
  • Calculating the contours of the image (through Sobel, canny or other edge detectors) then subtract the two images to see how many of their contours overlap. Again I don't think this can be handled in SQL.
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Matt Johnson
  • 1,913
  • 16
  • 23