More effective lossless compression for TIFF

Question

I am trying to archive TIFF images in a database, and I would like to compress the images as much as possible, even at the cost of higher CPU usage and high memory.

In order to test the compressions available in LibTiff.NET, I used the following code (modified from this sample):

//getImageRasterBytes and convertSamples are defined in the sample
void Main() {
    foreach (Compression cmp in Enum.GetValues(typeof(Compression))) {
        try {
            using (Bitmap bmp = new Bitmap(@"D:\tifftest\200 COLOR.tif")) {
                using (Tiff tif = Tiff.Open($@"D:\tifftest\output_{cmp}.tif", "w")) {
                    byte[] raster = utils.getImageRasterBytes(bmp, PixelFormat.Format24bppRgb);
                    tif.SetField(TiffTag.IMAGEWIDTH, bmp.Width);
                    tif.SetField(TiffTag.IMAGELENGTH, bmp.Height);
                    tif.SetField(TiffTag.COMPRESSION, cmp);
                    tif.SetField(TiffTag.PHOTOMETRIC, Photometric.RGB);

                    tif.SetField(TiffTag.ROWSPERSTRIP, bmp.Height);

                    tif.SetField(TiffTag.XRESOLUTION, bmp.HorizontalResolution);
                    tif.SetField(TiffTag.YRESOLUTION, bmp.VerticalResolution);

                    tif.SetField(TiffTag.BITSPERSAMPLE, 8);
                    tif.SetField(TiffTag.SAMPLESPERPIXEL, 3);

                    tif.SetField(TiffTag.PLANARCONFIG, PlanarConfig.CONTIG);

                    int stride = raster.Length / bmp.Height;
                    utils.convertSamples(raster, bmp.Width, bmp.Height);

                    for (int i = 0, offset = 0; i < bmp.Height; i++) {
                        tif.WriteScanline(raster, offset, i, 0);
                        offset += stride;
                    }
                }
            }
        } catch (Exception ex) {
            //code was run in LINQPad
            ex.Dump(cmp.ToString());
        }
    }
}

The test image is 200dpi 24bpp, 1700 width by 2200 height, and using LZW compression; the file size is nearly 7 MB. (The image is representative of the images I want to store.)

Of the algorithms that did work (some failed with various errors), the smallest compressed file was created using Compression.Deflate, but that only compressed to 5MB, and I would like it significantly smaller (under 1 MB).

There must be some algorithm for higher compression; a PDF file containing this image is something like 500Kb.

If a specific algorithm is incompatible with other TIFF viewers/libraries, this is not an issue, as long as we can extract the compressed TIFF from the database and convert it to a System.Drawing.Bitmap using LibTiff.Net or some other library.

How can I generate even smaller files with lossless compression? Is this even possible with these kinds of images?

Update

PDF file
TIFF file

It sounds like you might want to look at using PNG rather than TIFF format. PNG has lossless compression. — Dijkgraaf, Sep 26 '16 at 21:30
@Dijkgraaf We will be handling multipage images, which are not supported by PNG. (It's possible we might choose to store each page as a separate record in the database, instead of each file.) — Zev Spitz, Sep 26 '16 at 21:38
I don't see any image in your post. Show an example or it's hard to tell what's possible. But it's very very unlikely, that you can squeeze it down from 5mb (lzw) to 1 mb. — sascha, Sep 26 '16 at 21:47
Also: why not splitting these multipage tiffs, compressing each alone(candidates: general lossless compression like 7z/lz; other lossless image compressors like webp or some custom-compressor especially if there is spatial-correlation between these like in animations) and rebuild while decompressing? — sascha, Sep 26 '16 at 21:58
@SimonMourier I can't post it as it contains private information. I will try to get a similar image that I can post. — Zev Spitz, Oct 02 '16 at 08:35
@ZevSpitz A warning: the content of the pdf and the tiff is not the same (regarding pixels and channels). It's hard to help here if basic image-compression rules are skipped. There are so many cases which can result in different compressions (lossless and lossy). — sascha, Oct 09 '16 at 12:07
The PDF contains a 200dpi 1700x2200 24bits *JPG* file, so just forget about TIFFs and save JPG files in your database. Then you can rebuild TIFFs dynamically if you really want TIFFs. — Simon Mourier, Oct 09 '16 at 16:11

score 4 · Answer 1 · edited Oct 07 '19 at 05:55

Simple evaluation of the test-image

Just to give some numbers on the example image (the tiff one). All compressions are lossless and can recreate any other lossless format like bmp/png (which has been checked).

tiff-orig         5.779.814  
png (unoptimized) 3.084.641  53.37%
png (optimized)   2.795.230  48.36%  
png (zopfli)      2.791.680  48.30%
jpeg2000          2.230.967  38.60%
webp              2.021.710  34.98%  BSD
gralic            1.795.457  31.06%  
flif              1.778.976  30.78%  LGPL3

Remarks

These are just the results of one image
- Most of these still have potential gains, but a huge amount of time is needed for compression then
- While the general observation (in regards to ordering of the compression-efficiency of these compressors) should hold, the values will change for a bigger testset
Most of these compressors are created to handle single-images only
- It would be an easy task to split the multi-tiff to single ones; compress each; store the connections somehow
- This is also very natural within a DB-setup
- If these multi-tiff images are strongly correlated, it might be possible to use this (e.g. general-purpose compressors; or a custom-approach)
As i indicated in the comments, the kind of reduction you wanted is not possible for most types of images (e.g. photos or scans; sticking to lossless compression)
- There is much to be told, but the most important aspect is: They contain a lot of noise and noise can't be compressed

For fun: denoise + lossless-compression

As noise is the most important factor killing lossless-compression potentials, let's remove some. We are doing this with this python-based code, but there are many more possible approaches. The following code uses a nonlinear-filter which tries to remove noise while keeping important edges.

Of course information is lost here, but i actually like the denoised image a bit more as it's nicer to read (in my opinion).

Code for denoising

from skimage.io import imread, imsave
from skimage.restoration import denoise_bilateral

img = imread("200 DPI.tif")
img_denoised = denoise_bilateral(img, multichannel=True, sigma_range=0.05, sigma_spatial=15)
imsave("200 DPI_denoised.png", img_denoised)

Evaluation

flif (denoised) 1.140.497  19.73%

Alex I · Answer 2 · 2016-10-07T06:21:33.700

Two parts to the answer:

Make it lossy in a way you choose, rather than the way a lossy codec does it. For example, if you are working with scanned text images, do brightness/contrast normalization (possibly local normalization) so the page background is pure white. This will improve compressibility by a lot; it could make a 10MB grayscale text page with almost but not exactly white background into a 200kB page with pure white background and grayscale text (using LZW)
Use JPEG2000. If you want best possible lossless compression, JPEG2000 with lossless settings will likely beat any other algorithm such as PNG, especially for content like photos, but also for scanned pages. Storing your JPEG2000 inside TIFF containers should also be possible, but it is not a very common feature of TIFF libraries; you may or may not want to do that. I think JPEG2000 has a feature for multiple images in one file also.

JPEG2000 is really not that good compared to more modern stuff like WebP or FLIF. Of course there are even more research-state compressors which are even better. E.g. gralic. But maybe the compatibility with tiff (you mentioned) is something he wants. — sascha, Oct 09 '16 at 10:07

score 0 · Answer 3 · answered Oct 06 '16 at 14:03

Read on G4 compression method: https://en.wikipedia.org/wiki/Group_4_compression

On average that method gives you 20:1 compress ratio.

Here is C# example (credits to : https://www.experts-exchange.com/viewCodeSnippet.jsp?codeSnippetId=20-41218205-1 ):

byte[] imgBits = File.ReadAllBytes(@"multipage_tif.tif");
using (MemoryStream ms = new MemoryStream(imgBits)) {
    using (Image i = Image.FromStream(ms)) {
        EncoderParameters parms = new EncoderParameters(1);
        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders().FirstOrDefault(decoder => decoder.FormatID == ImageFormat.Tiff.Guid);    
        parms.Param[0] = new EncoderParameter(Encoder.Compression, (long)EncoderValue.CompressionCCITT4);

        i.Save("out.tif", codec, parms);
    }
}

From the Wikipedia page: _It is only used for bitonal images_, and the images under discussion are 24bpp. — Zev Spitz, Oct 07 '16 at 06:25
@ZevSpitz yes I missed the 24bpp part .. G4 is used for scanned documents usually — Deian, Oct 12 '16 at 13:58