After some research I came up with a solution which is fast and clean per my tests.
The overall solution is this:
- Create a container for your image via
docker create ...
- Export its entire file system to a tar archive via
docker export ...
- Pipe the archive directory names, symlink names, symlink contents, file names, and file contents, to an hash function (e.g., MD5)
- Compare the hashes of different images to verify if their contents are equal or not
And that's it.
Technically, this can be done as follows:
1) Create file md5docker
, and give it execution rights, e.g., chmod +x md5docker
:
#!/bin/sh
dir=$(dirname "$0")
docker create $1 | { read cid; docker export $cid | $dir/tarcat | md5; docker rm $cid > /dev/null; }
2) Create file tarcat
, and give it execution rights, e.g., chmod +x tarcat
:
#!/usr/bin/env python3
# coding=utf-8
if __name__ == '__main__':
import sys
import tarfile
with tarfile.open(fileobj=sys.stdin.buffer, mode="r|*") as tar:
for tarinfo in tar:
if tarinfo.isfile():
print(tarinfo.name, flush=True)
with tar.extractfile(tarinfo) as file:
sys.stdout.buffer.write(file.read())
elif tarinfo.isdir():
print(tarinfo.name, flush=True)
elif tarinfo.issym() or tarinfo.islnk():
print(tarinfo.name, flush=True)
print(tarinfo.linkname, flush=True)
else:
print("\33[0;31mIGNORING:\33[0m ", tarinfo.name, file=sys.stderr)
3) Now invoke ./md5docker <image>
, where <image>
is your image name or id, to compute an MD5 hash of the entire file system of your image.
To verify if two images have the same contents just check that their hashes are equal as computed in step 3).
Note that this solution only considers as content directory structure, regular file contents, and symlinks (soft and hard). If you need more just change the tarcat
script by adding more elif
clauses testing for the content you wish to include (see Python's tarfile, and look for methods TarInfo.isXXX()
corresponding to the needed content).
The only limitation I see in this solution is its dependency on Python (I am using Python3, but it should be very easy to adapt to Python2). A better solution without any dependency, and probably faster (hey, this is already very fast), is to write the tarcat
script in a language supporting static linking so that a standalone executable file was enough (i.e., one not requiring any external dependencies, but the sole OS). I leave this as a future exercise in C, Rust, OCaml, Haskell, you choose.
Note, if MD5 does not suit your needs, just replace md5
inside the first script with your hash utility.
Hope this helps anyone reading.