Is it possible to extract tiff files from PDFs without external libraries?

Question

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them? Thanks, David

score 3 · Answer 1 · answered Aug 13 '11 at 16:14

PDF files may contain different image data (not surprisingly).

Most common cases are:

Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data

Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.

I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.

Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.

And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.

So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.

I would much rather use a library but I have not been able to find a reasonably priced royalty free library that works on both Windows and Mac. — David, Aug 13 '11 at 16:37

score 1 · Answer 2 · answered Aug 13 '11 at 16:06

1

PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.

answered Aug 13 '11 at 16:06

mark stephens

3,205
16
19

Thanks, Mark --- I did see those articles --- I was just hoping that somebody had already done the hard work with a simple example like I had found for the jpg example in python. – David Aug 13 '11 at 16:35

Is it possible to extract tiff files from PDFs without external libraries?

2 Answers2