loading and saving a JPEG image results in different file content

Question

Here's a script that reads a JPG image and then writes 2 JPG images:

import cv2

# https://github.com/opencv/opencv/blob/master/samples/data/baboon.jpg 
input_path = './baboon.jpg'

# Read image
im = cv2.imread(input_path)

# Write image using default quality (95)
cv2.imwrite('./baboon_out.jpg', im)

# Write image using best quality
cv2.imwrite('./baboon_out_100.jpg', im, [cv2.IMWRITE_JPEG_QUALITY, 100])

after running the above script, here's what the files look like:

.
├── baboon.jpg
├── baboon_out.jpg
├── baboon_out_100.jpg
└── main.py

However, the MD5 checksums of the JPGs created by the script do not match the original:

>>> md5 ./baboon.jpg 
MD5 (./baboon.jpg) = 9a7171af1d6c6f0901d36d04e1bd68ad
>>> md5 ./baboon_out.jpg 
MD5 (./baboon_out.jpg) = 1014782b9e228e848bc63bfba3fb49d9
>>> md5 ./baboon_out_100.jpg 
MD5 (./baboon_out_100.jpg) = dbadd2fadad900e289e285393778ad89

Is there anyway to preserve the original image content with OpenCV? In which step is the data being modified?

The original image content is `baboon.jpg`. Writing new JPG files with an arbitrary encoder and arbitrary settings isn't even a little bit guaranteed to produce the same output as the original; that's the entire problem of generation loss in a lossy codec. — TigerhawkT3, Nov 14 '22 at 05:04
_Is there anyway to preserve the original image content_ Sure. `cp baboon.jpg baboon2.jpg` — John Gordon, Nov 14 '22 at 05:06
Ok so the encode step is where data is changed -- [even with quality 100, JPG compression is lossy](https://stackoverflow.com/a/48578831/10720618). So what I need is to use a lossless codec like PNG — kym, Nov 14 '22 at 05:22
No, that is still incorrect. Reading a lossless PNG either with OpenCV or PIL and writing it again immediately afterwards without changing the pixels in any way will almost certainly result in a different hash. — Mark Setchell, Nov 14 '22 at 07:20
reading and writing a png will preserve the pixel values, so the RGB image that is generated from the written png file will be the same as from the read png file. The hashes might still be different because of different encoder settings (compression rates etc.) and maybe different meta data. However, if you write a png A with opencv, then read the png A with opencv and write again to png B, then A and B should likely have the same hash code. — Micka, Nov 14 '22 at 08:23
@Micka my use case is to use the hash value as a cache key so that I can avoid excess API calls for the same image. So "RGB image that is generated from the written png file will be the same as from the read png file" seems like it'll work for my use case as long as I read in the image and hash the RGB values instead of the file directly. Does that sound correct? — kym, Nov 14 '22 at 15:22
@kym yes that should work, as long as memory alignment is the same. On some machines, a padding could be present at the end of each image row. If you want to mix different APIs like pil and opencv, make also sure that color space is the same (RGB vs. BGR). — Micka, Nov 14 '22 at 15:48

ti7 · Answer 1 · 2022-11-14T06:05:53.003

It's a simplification, but the image will probably always change the checksum if you're reading a JPEG image because it's a lossy format being re-encoded and read by an avalanching hash and definitely will if you do any meaningful work on it, even if (especially if) you directly manipulated the bits of the file rather than loading it as an image

If you want the hash to be the same, just copy the file!

If you're looking for pixel-perfect manipulation, try a lossless format like PNG and understand if you are comparing the pixels or meta attributes a-la Exif (ie. should the Google, GitHub (Microsoft), and Twitter-affiliated fields from your baboon.jpg be included in your comparison? what about color data or depth information..?)

$ exiftool baboon.jpg | grep Description
Description                     : Open Source Computer Vision Library. Contribute to opencv/opencv development by creating an account on GitHub.
Twitter Description             : Open Source Computer Vision Library. Contribute to opencv/opencv development by creating an account on GitHub.

Finally, watch out specifically when using MD5 as it's largely considered broken - a more modern checksum like SHA-256 might help with unforseen issues

Mark Setchell · Answer 2 · 2022-11-14T08:31:55.083

There's a fundamental misconception here about the way images are stored and what lossy/lossless means.

Let's imagine we have a really simple 1-pixel grey image and that pixel has brightness 100. Let's make our own format for storing our image... I'm going to choose to store it as a Python script. Ok, here's our file:

print(100)

Now, let's do another lossless version, like PNG might:

print(10*10)

Now we have recreated our image losslessly, because we still get 100, but the MD5 hash is clearly going to be different. PNG does filtering before compressing and it checks whether each pixel is similar to the ones above and to the left in order to save space. One encoder might decide it can do better by looking at the pixel above, another one might think it can do better by looking at the pixel to the left. The result is the same.

Now let's make a new version of the file that stores the image:

# Updated 2022-08-12T21:15:46.544Z
print(100)

Now there's a new comment. The metadata about the image has changed, and so has the MD5 hash. But no pixel data has changed so it is a lossless change. PNG images often store the date/time in the file like this, so even two bit-for-bit identical images will have a different hash for the same pixels if saved 1 second apart.

Now let's make our single pixel image like a JPEG:

print(99)

According to JPEG, that'll look pretty similar, and it's 1-byte shorter, so it's fine. How do you think the MD5 hash will compare?

Now there's a different JPEG library, it thinks this is pretty close too:

print(98)

Again, the image will appear indistinguishable from the original, but your MD5 hash will be different.

TL/DR;

If you want an identical, bitwise perfect copy of your file, use the copy command.

Further food for thought... when you read an image with OpenCV, you actually get a Numpy array of pixels, don't you. Where is the comment that was in your image? The copyright? The GPS location? The camera make and model and shutter speed? The ICC colour profile? They are not in your Numpy array of pixels, are they? They are lost. So, even if the PNG/JPEG encoder made exactly the same decisions as the original encoder, your MD5 hash would still be different because all that other data is missing.

loading and saving a JPEG image results in different file content

2 Answers2