After doing some research and running an bunch of tests here I present my solution to my question.
First, I want to make clear that we are not talking about a forensic investigation. There are possibly ways to manipulate a JPG image in a way that markers appear where they shouldn't and do not appear where would have to according to the specs.
We are not talking about image identity or similarity, either. If you losslessly rotate a JPG you still have the very same image information, but not the identical image any more. We're not talking, either, about images that have been resized, optimized or altered in any other way.
What we are talking about is identifying simple duplicates or JPGs that have been renamed or where metadata has been modified or removed, but where the image itself has never been processed or tampered with in any way.
Is a hash of the bytes between the SOS and the EOI markers a reliable way to uniquely identify an image?
Yes, it is. Within bounds of reason there is no way two files with identical MD5 checksums of the image scan data can contain non-identical images and vice versa.
I examined sample photos taken with cameras from 12 different makers and edited/stripped the metadata. Actually, this wasn't really necessary, because from the specs and the code you know that all metadata resides in separate blocks (that's why you can hide all kind of stuff in a JPG) and the scan data will never be touched by metadata operations, but yes, identical MD5 checksums all over the place.
Is there any way to quickly locate the (right) SOS marker?
Definitely. The JPG specs are a mess and a punishment. After trying quite a few pieces of code I found NativeJPG by Nils Haeck to be the most straightforward.
This has been adapted from sdJpegImage:
function FindSOSPos(S: TStream): Cardinal;
var
B, MarkerTag, BytesRead: byte;
Size,W: word;
const
mkNone = 0; mkSOF0 = $c0; mkSOF1 = $c1; mkSOF2 = $c2; mkSOF3 = $c3; mkSOF5 = $c5;
mkSOF6 = $c6; mkSOF7 = $c7; mkSOF9 = $c9; mkSOF10 = $ca; mkSOF11 = $cb; mkSOF13 = $cd;
mkSOF14 = $ce; mkSOF15 = $cf; mkDHT = $c4; mkDAC = $cc; mkSOI = $d8; mkEOI = $d9; mkSOS = $da;
mkDQT = $db; mkDNL = $dc; mkDRI = $dd; mkDHP = $de; mkEXP = $df; mkAPP0 = $e0; mkAPP15 = $ef; mkCOM = $fe;
begin
Repeat
Result := 0;
// Read markers from the stream, until a non $FF is encountered
If S.Read(B, 1) = 0 then
exit;
// Do we have a marker?
if B = $FF then
begin
BytesRead := S.Read(MarkerTag, 1);
while (BytesRead > 0) and (MarkerTag = $FF) do
begin
MarkerTag := mkNone;
BytesRead := S.Read(MarkerTag, 1);
end;
Size := 0;
if MarkerTag in [mkAPP0..mkAPP15, mkDHT, mkDQT, mkDRI,
mkSOF0, mkSOF1, mkSOF2, mkSOF3, mkSOF5, mkSOF6, mkSOF7, mkSOF9, mkSOF10, mkSOF11, mkSOF13, mkSOF14, mkSOF15,
mkCOM, mkDNL] then
begin
// Read length of marker
If S.Read(W, 2) = 2 then
Size := Swap(W) - 2
else exit;
end else
If MarkerTag = mkSOS
then break;
S.Position := S.Position + Size;
end else
begin
// B <> $FF is an error, we try to be flexible
repeat
BytesRead := S.Read(B, 1);
until (BytesRead = 0) or (B = $FF);
if BytesRead = 0 then
exit;
S.Seek(-1, soFromCurrent);
end;
Until (MarkerTag = mkSOS) or (MarkerTag = mkNone);
Result := S.Position;
end;
Omit the first 6 Bytes after the SOS marker?
I decided to hash everything between SOS and EOI excluding the markers themselves.
Is there a fast way to locate the trailing EOI marker?
No. But this is irrelevant, since for performing a hash you have to read every single byte anyway.
How reliable is this approach?
As I said, I believe that within bounds of reason the chance that this approach will render no false positives is practically 100%. As to locating the right image: NativeJPG has been around for more than 10 years and you find very few complaints, if any they deal with decoding the image, not missing it.
In my application I offer the option to store the original filename, the EXIF DateTimeDigitized, the camera make, the GPS coordinates and MD5 hashes of the scan data (full and first 16 kB) in the UserComment field. I'm pretty confident that this will allow to lateron identify the file under most conditions (if the UserComment has remained intact).