47

I am trying to write a function to determine whether two equal-size bitmaps are identical or not. The function I have right now simply compares a pixel at a time in each bitmap, returning false at the first non-equal pixel.

While this works, and works well for small bitmaps, in production I'm going to be using this in a tight loop and on larger images, so I need a better way. Does anyone have any recommendations?

The language I'm using is C# by the way - and yes, I am already using the .LockBits method. =)

Edit: I've coded up implementations of some of the suggestions given, and here are the benchmarks. The setup: two identical (worst-case) bitmaps, 100x100 in size, with 10,000 iterations each. Here are the results:

CompareByInts (Marc Gravell) :   1107ms
CompareByMD5  (Skilldrick)   :   4222ms
CompareByMask (GrayWizardX)  :    949ms

In CompareByInts and CompareByMask I'm using pointers to access the memory directly; in the MD5 method I'm using Marshal.Copy to retrieve a byte array and pass that as an argument to MD5.ComputeHash. CompareByMask is only slightly faster, but given the context I think any improvement is useful.

Thanks everyone. =)

Edit 2: Forgot to turn optimizations on - doing that gives GrayWizardX's answer even more of a boost:

CompareByInts   (Marc Gravell) :    944ms
CompareByMD5    (Skilldrick)   :   4275ms
CompareByMask   (GrayWizardX)  :    630ms
CompareByMemCmp (Erik)         :    105ms

Interesting that the MD5 method didn't improve at all.

Edit 3: Posted my answer (MemCmp) which blew the other methods out of the water. o.O

Erik Forbes
  • 35,357
  • 27
  • 98
  • 122
  • 5
    Have you tested this with the "full size" bitmaps you are going to run through in production? Has it proven itself slow? Have you profiled your code on your production bitmaps to determine where it is slowing down? –  Jan 08 '10 at 22:30
  • Yes - the problem is that the worst case - scanning both bitmaps and determining that they are identical - is also the most common case. – Erik Forbes Jan 08 '10 at 22:36
  • 4
    Define "identical" in the context of bitmaps. Do you mean, binary identical, file-based (both are .png files, and the contents are identical)? Do you mean pixel-identical (could be different file formats, but the pixels are the same)? Do you mean perceptually identical (slightly different red channels are OK since the eyes of humans aren't that good to discern differences in the red channel anyway? – Lasse V. Karlsen Jan 08 '10 at 22:51
  • 1
    just a small tip. We (in college) used to do a 1D allocation for the image using pointers (in C) with one malloc instead of multiple allocations when having the image represented as 2D. – ram Jan 09 '10 at 00:40
  • I mean pixel-perfect identical. – Erik Forbes Jan 09 '10 at 19:24
  • 1
    Thanks for doing some science with your test cases. – Factor Mystic Oct 11 '15 at 22:07
  • See my answer in http://stackoverflow.com/questions/43289/comparing-two-byte-arrays-in-net slightly beats `memcmp`. – Mr Anderson Jun 23 '16 at 22:38

9 Answers9

45

Edit 8-31-12: per Joey's comment below, be mindful of the format of the bitmaps you compare. They may contain padding on the strides that render the bitmaps unequal, despite being equivalent pixel-wise. See this question for more details.


Reading this answer to a question regarding comparing byte arrays has yielded a MUCH FASTER method: using P/Invoke and the memcmp API call in msvcrt. Here's the code:

[DllImport("msvcrt.dll")]
private static extern int memcmp(IntPtr b1, IntPtr b2, long count);

public static bool CompareMemCmp(Bitmap b1, Bitmap b2)
{
    if ((b1 == null) != (b2 == null)) return false;
    if (b1.Size != b2.Size) return false;

    var bd1 = b1.LockBits(new Rectangle(new Point(0, 0), b1.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
    var bd2 = b2.LockBits(new Rectangle(new Point(0, 0), b2.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);

    try
    {
        IntPtr bd1scan0 = bd1.Scan0;
        IntPtr bd2scan0 = bd2.Scan0;

        int stride = bd1.Stride;
        int len = stride * b1.Height;

        return memcmp(bd1scan0, bd2scan0, len) == 0;
    }
    finally
    {
        b1.UnlockBits(bd1);
        b2.UnlockBits(bd2);
    }
}
Community
  • 1
  • 1
Erik Forbes
  • 35,357
  • 27
  • 98
  • 122
  • 7
    Side note: [This does not necessarily work, because each stride may contain padding](http://stackoverflow.com/q/12205247/73070). – Joey Aug 31 '12 at 05:36
  • 1
    Ah, very interesting. I hadn't considered this possibility - although in practice for my application this isn't an issue as the bitmaps in question are formatted `Format32bppRgb`. Thanks for the hint! =) – Erik Forbes Aug 31 '12 at 11:46
  • It occurs to me that another way to deal with the stride padding issue would be to invoke memcmp on a line by line basis, and setting 'count' to the number of bytes per line minus the padding. Might take a little longer, but would reduce the number of false misses due to padding inconsistencies. – Erik Forbes Jul 29 '13 at 13:10
  • 1
    The question is what overhead a P/Invoke call per line has in that case. – Joey Jul 29 '13 at 13:24
  • I did some timing tests and found that, for how I'm comparing bitmaps, the above example is slower by a factor of 100 as compared to a direct comparison of each pixel using a `for` loop and `b1.GetPixel(x, y) == b2.GetPixel(x, y)`. I'm doing small bitmap comparisons (around 100 x 10 px), so the result might change for comparisons against larger bitmaps. – BenR Sep 04 '14 at 22:08
  • 3
    Note that the last parameter of `memcmp` is an `IntPtr`, not a `long`, because if the program is running at 32 bits it is 32 bits, at 64 bits it is 64 bits. Clearly then you have to `return memcmp(bd1scan0, bd2scan0, (IntPtr)len) == 0;` – xanatos Jul 23 '15 at 13:38
9

If you are trying to determine if they are 100% equal, you can invert one and add it to the other if its zero they are identical. Extending this using unsafe code, take 64 bits at a time as a long and do the math that way, any differences can cause an immediate fail.

If the images are not 100% identical (comparing png to jpeg), or if you are not looking for a 100% match then you have some more work ahead of you.

Good luck.

GrayWizardx
  • 19,561
  • 2
  • 30
  • 43
  • I am looking for 100% identical, so this method could work. Thanks =) – Erik Forbes Jan 08 '10 at 22:34
  • 1
    Isn't adding one pixel to the inverse of another and seeing if the result is zero equivalent or slower to comparing the two pixels and seeing if they're the same? Or am I missing something? – Skilldrick Jan 08 '10 at 22:34
  • He is using pointer arithmetic (i.e. unsafe code) and so he can treat the pointers as an array of fixed size types (i.e. longs) and do pure hardware math. – GrayWizardx Jan 08 '10 at 22:36
  • Also testing for zero is performs better than testing for any other constant, from what I recall. – Erik Forbes Jan 08 '10 at 22:42
  • 1
    You can't invert one and subtract it from the other. That would be like a - (-b). If a==b, then this will equal a*2. I have no problems understanding what you mean though, but you should get the method correct. – Lasse V. Karlsen Jan 08 '10 at 23:04
  • I think inversion is still OK, provided its all done with 8 bits. – ram Jan 09 '10 at 00:34
  • 32-bits works, and is fastest in this context - so I've marked it as the answer. I'll be posting benchmark results shortly. – Erik Forbes Jan 09 '10 at 19:25
  • By the way - the relevant comparison is: `if ((~cursor1[0] & cursor2[0]) != 0x00) return false;` - invert one, AND it to the other, and check the result. Alternatively I could XOR them together without inverting, and compare that result to zero. Either way it seems like the advantage this has over comparing for equality is that comparisons with zero are faster than comparing with any other constant. – Erik Forbes Jan 09 '10 at 19:46
8

Well, you're using .LockBits, so presumably you're using unsafe code. Rather than treating each row origin (Scan0 + y * Stride) as a byte*, consider treating it as an int*; int arithmetic is pretty quick, and you only have to do 1/4 as much work. And for images in ARGB you might still be talking in pixels, making the math simple.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • I will be using ARGB images so this sounds like it might be the winner. I'll give it a shot and do some profiling. Well, some more profiling. – Erik Forbes Jan 08 '10 at 22:39
6

Could you take a hash of each and compare? It would be slightly probabilistic, but practically not.

Thanks to Ram, here's a sample implementation of this technique.

Skilldrick
  • 69,215
  • 34
  • 177
  • 229
  • 4
    It won't fail fast, but it does work better if you have to compare 1 image to multiple candidates... – EricLaw Jan 08 '10 at 22:38
  • 1
    You could do a hybrid, and test a random sample of say 2% of pixels to see if they're the same, and if so move onto hashing... – Skilldrick Jan 08 '10 at 22:40
  • +1 for mentioning hash. Here is a sample implementation http://www.codeproject.com/KB/GDI-plus/comparingimages.aspx – ram Jan 09 '10 at 00:38
3

If the original problem is just to find the exact duplicates among two bitmaps, then just a bit level comparison will have to do. I don't know C# but in C I would use the following function:

int areEqual (long size, long *a, long *b)
{
    long start = size / 2;
    long i;
    for (i = start; i != size; i++) { if (a[i] != b[i]) return 0 }
    for (i = 0; i != start; i++) { if (a[i] != b[i]) return 0 }
    return 1;
}

I would start looking in the middle because I suspect there is a much better chance of finding unequal bits near the middle of the image than the beginning; of course, this would really depend on the images you are deduping, selecting a random place to start may be best.

If you are trying to find the exact duplicates among hundreds of images then comparing all pairs of them is unnecessary. First compute the MD5 hash of each image and place it in a list of pairs (md5Hash, imageId); then sort the list by the m5Hash. Next, only do pairwise comparisons on the images that have the same md5Hash.

Jeff Kubina
  • 800
  • 4
  • 15
3

If these bitmaps are already on your graphics card then you can parallelize such a check by doing it on the graphics card using a language like CUDA or OpenCL.

I'll explain in terms of CUDA, since that's the one I know. Basically CUDA lets you write general purpose code to run in parallel across each node of your graphics card. You can access bitmaps that are in shared memory. Each invocation of the function is also given an index within the set of parallel runs. So, for a problem like this, you'd just run one of the above comparison functions for some subset of the bitmap - using parallelization to cover the entire bitmap. Then, just write a 1 to a certain memory location if the comparison fails (and write nothing if it succeeds).

If you don't already have the bitmaps on your graphics card, this probably isn't the way to go, since the costs for loading the two bitmaps on your card will easily eclipse the savings such parallelization will gain you.

Here's some (pretty bad) example code (it's been a little while since I programmed CUDA). There's better ways to access bitmaps that are already loaded as textures, but I didn't bother here.

// kernel to run on GPU, once per thread
__global__ void compare_bitmaps(long const * const A, long const * const B, char * const retValue, size_t const len)
{
 // divide the work equally among the threads (each thread is in a block, each block is in a grid)
 size_t const threads_per_block = blockDim.x * blockDim.y * blockDim.z;
 size_t const len_to_compare = len / (gridDim.x * gridDim.y * gridDim.z * threads_per_block);
# define offset3(idx3,dim3)  (idx3.x + dim3.x * (idx3.y + dim3.y * idx3.z))
 size_t const start_offset = len_to_compare * (offset3(threadIdx,blockDim) + threads_per_block * offset3(blockIdx,gridDim));
 size_t const stop_offset = start_offset + len_to_compare;
# undef offset3

 size_t i;
 for (i = start_offset; i < stop_offset; i++)
 {
  if (A[i] != B[i]) 
  {
   *retValue = 1;
   break;
  }
 }
 return;
}
Erik Forbes
  • 35,357
  • 27
  • 98
  • 122
rampion
  • 87,131
  • 49
  • 199
  • 315
0

If you can implement something like Duff's Device in your language, that might give you a significant speed boost over a simple loop. Usually it's used for copying data, but there's no reason it can't be used for comparing data instead.

Or, for that matter, you may just want to use some equivalent to memcmp().

rmeador
  • 25,504
  • 18
  • 62
  • 103
  • 1
    You can do loop unrolling in virtually any language (where it applies). You may not get such pretty and compact syntax as Duff's device, but the performance will be similar. – R. Martinho Fernandes Jan 08 '10 at 22:58
0

You could try to add them to a database "blob" then use the database engine to compare their binaries. This would only give you a yes or no answer to whether the binary data is the same. It would be very easy to make 2 images that produce the same graphic but have different binary though.

You could also select a few random pixels and compare them, then if they are the same continue with more until you've checked all the pixels. This would only return a faster negative match though, it still would take as long to find 100% positive matches

Drew
  • 142
  • 1
  • 12
0

Based on the approach of comparing hashes instead of comparing every single pixel, this is what I use:

public static class Utils
{
    public static byte[] ShaHash(this Image image)
    {
        var bytes = new byte[1];
        bytes = (byte[])(new ImageConverter()).ConvertTo(image, bytes.GetType());

        return (new SHA256Managed()).ComputeHash(bytes);
    }

    public static bool AreEqual(Image imageA, Image imageB)
    {
        if (imageA.Width != imageB.Width) return false;
        if (imageA.Height != imageB.Height) return false;

        var hashA = imageA.ShaHash();
        var hashB = imageB.ShaHash();

        return !hashA
            .Where((nextByte, index) => nextByte != hashB[index])
            .Any();
    }
]

Usage is straight forward:

bool isMatch = Utils.AreEqual(bitmapOne, bitmapTwo);
nathanchere
  • 8,008
  • 15
  • 65
  • 86
  • 1
    Your compare will always return true for the images of the same dimension. "hashB = imageA.ShaHash()" should be "imageB". Typo? – Wbmstrmjb Nov 20 '14 at 22:46