Extract TIFF images from PDF without decoding

Question

With the help of iText 5 I would like to extract all TIFF images from given PDF file and save them as TIFF files. Examples and other posts (1, 2) use the following method:

Create PdfImageObject from PDF stream which in line 189 decodes the image stream (if corresponding filter implementation is present).
Call PdfImageObject#getImageAsBytes() which returns JPEG (original), PNG (re-encoded) or TIFF (in case of 8 bits per pixel).

As a result TIFF image with 1 bit color depth is converted to PNG, which is not what I need.

Another approach would be to call PdfImageObject#getBufferedImage() which will decode the image in step (2) into raster and afterwards encode it again as TIFF using ImageIO.write(bufferedImage, "tiff", file).

As one can see this is not efficient. Another solution shown in this post demonstrates how to save encoded TIFF image stream to file by prepending it a TIFF header – that is the solution I am looking for.

Can iText help here?

K J · Answer 1 · 2023-04-23T11:31:43.343

By far the simplest is to shell out to the OS and use (Debian manual)pdfimages from any recent poppler utils package

For windows they are at https://github.com/oschwartz10612/poppler-windows
other versions may have different output abilities.

poppler/bin> pdfimages -tiff in.pdf out

This will (/should) extract all images as colour.tif or mono.tif using names such as out-000.tif etc however note it is normal for single colour masks to look reversed as that's often how they are stored in a PDF

Also the Mono tiff will be as requested at a comparative density so a source at nominal 300 dpi resolution on paper will export as lossless mono uncompressed tiff with a PDF nominal density of 72dpi. Exactly correct number of pixels, however appear to be larger in scale, and seem to be different colour.

Thanks for the hint! Good utility, but as to man page it does not support neither TIFF nor PNG output formats, only PPM and PBM. Not clear why those two are popular on Linux, while are not supported by Web browsers Also the question was mostly about iText API. I have created a Java utility [merge2pdf](https://github.com/dmak/merge2pdf) that extracts images using that library, but it does not support extraction to TIFFs (only JPG/PNG), that would need an additional conversion step. — dma_k, Apr 23 '23 at 10:17

score 0 · Answer 2 · answered Jul 22 '19 at 01:19

0

PDF images are not TIFF images.

PDFs however can contain images that use compression techniques that are also used in TIFF, e.g. Flate, CCITT, LZW, JPEG.

answered Jul 22 '19 at 01:19

JosephA

1,187
3
13
27

Thanks for the information, however your post does not provide the answer. You can add it as comment to the question. Indeed, TIFF may support part of compression methods, however it supports Deflate, LZW, CCITT and JPEG, see [wikipedia](https://en.wikipedia.org/wiki/TIFF#TIFF_Compression_Tag). – dma_k Jul 28 '19 at 22:42

Extract TIFF images from PDF without decoding

2 Answers2