Strictly speaking (Is it possible to use libtiff...?), yes. It involves some hacking, but not too much.
Fact: the data will be comprised of one strip, since there isn't any offset information, so our only offset is zero. We just need to read the strip in.
Fact: this data is the compression of a W*H 1-bit deep pixel matrix.
Step 1: estimate the maximum possible length of the compressed stream. This comes out at around 15% of W*H, i.e. with W=1000 and H=1000 you get 150000 bytes. This value will always be more than the actual value. If we have a better estimate thanks to having located the proper EI end-image tag, that's even better but not necessary.
Step 2: build a "virtual" TIF file. This will be made up of a header of the form 49 49 2a 00 AA BB CC DD
, where 0xDDCCBBAA is the estimated length plus 8; followed by our estimated data stream; followed by a TIFF directory.
Step 3: the TIFF directory will always have the same structure; some values in it are offsets and depend trivially from the IFD position 0xDDCCBBAA. Quoting from the TIFF6 specs (note that byte order is reversed - Motorola, not Intel endian):
TIFF 6.0 Specification Final—June 3, 1992 20
Putting it all together (along with a couple of less-important fields that are discussed
later), a sample bilevel image file might contain the following fields
A Sample Bilevel TIFF File
Offset Description Value
(hex) (numeric values are expressed in hexadecimal notation)
Header:
0000 Byte Order 4D4D
0002 42 002A
0004 1st IFD offset 00000014
IFD:
0014 Number of Directory Entries 000C
0016 NewSubfileType 00FE 0004 00000001 00000000
0022 ImageWidth 0100 0004 00000001 000007D0
002E ImageLength 0101 0004 00000001 00000BB8
003A Compression 0103 0003 00000001 8005 0000
0046 PhotometricInterpretation 0106 0003 00000001 0001 0000
0052 StripOffsets 0111 0004 000000BC 000000B6(*1)
005E RowsPerStrip 0116 0004 00000001 00000010
006A StripByteCounts 0117 0003 000000BC 000003A6(*2)
0076 XResolution 011A 0005 00000001 00000696(*3)
0082 YResolution 011B 0005 00000001 0000069E(*4)
008E Software 0131 0002 0000000E 000006A6(*5)
009A DateTime 0132 0002 00000014 000006B6(*6)
00A6 Next IFD offset 00000000
Values longer than 4 bytes:
(*1) StripOffsets Offset0 00000008
(*2) StripByteCounts Count0
(*3) XResolution 0000012C 00000001
(*4) YResolution 0000012C 00000001
(*5) Software “PageMaker 4.0”
(*6) DateTime “1988:02:18 13:59:59”
In the above, 0xDDCCBBAA is actually 0014 and all the other offsets follow.
I have done some tests using a single-strip TIFFG4 image I've generated with ImageMagick and tiffcp
'ed to 1-strip CCITT format. The header there is slightly different (I don't see the Software and Datetime tags that the spec say should be there). Otherwise it checks.
We now have a damaged TIFF image with one overlong strip, and it is in memory.
Using TIFFClientOpen
, we can access it as if it was a disk image.
Attempting to read the first strip will now result in an error and the program aborting:
TIFFFillStrip: Read error on strip 0; got 143151 bytes, expected 762826.
By using TIFFSetErrorHandler
and TIFFSetErrorHandlerExt
we set up ourselves to intercept this error, and parse it, thereby recovering the 143151
information, instead of aborting.
We need to supply the callbacks to TIFFClientOpen
, but they're all very easy:
TIFFReadWriteProc readproc(h, *ptr, n) // copy n bytes from FakeBuffer+pos into ptr, update pos to pos + n, ignore h.
TIFFReadWriteProc writeproc // Throw an error. We don't write
TIFFSeekProc seekproc // update pos appropriately
TIFFCloseProc closeproc // do nothing
TIFFSizeProc sizeproc // return total buffer size
TIFFMapFileProc mapproc // Set to NULL
TIFFUnmapFileProc unmapproc // Set to NULL
The processing is indeed awkward and convoluted, but as for feasibility, it
can be done.
I have run tests in C language, extracting by hand the CCITT stream from an inline-image BI/ID/EI PDF I found online, and reading it as described above.
If I had a sure-fire way of identifying the correct EI - I've dredged up a message by Tilman Hausherr explaining a hack to recognize valid PDF operators following the EI in order to do so, which makes me think there probably aren't many better methods - I could always estimate the correct offset, and directly produce a correct and readable TIFF file from the PDF without even involving libtiff at all.