-1

I am reading text from PDF documents using the iText library. However, some pdf documents might have an image embedded with-in them in addition to text.

I'm wondering whether there is any way, through iText or something else, to determine if the pdf document contains an image?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Anthony
  • 33,838
  • 42
  • 169
  • 278
  • look here http://stackoverflow.com/questions/7007917/how-to-extract-images-from-a-pdf-with-itext-in-the-correct-order Use the same basic sets to see if one exists. – Phil Jun 20 '13 at 21:05
  • If you don't want to switch to PDFBox add suggested by @Phil's reference... You can use the iText classes from the parser package for bitmap image extraction, too. – mkl Jun 20 '13 at 22:10
  • I came across this link, however, I need to find out whether an image even exists in the pdf. http://itextpdf.com/examples/iia.php?id=284 – Anthony Jun 20 '13 at 22:17
  • In that case simply create an own image render listener. If it is only to check for existence of an image, it'll be much simpler than the one used in that sample. – mkl Jun 20 '13 at 22:35

2 Answers2

3

You can do a correct and 100% reliable check using a PDF library.

However you can probably do a fairly reliable check just by reading the PDF as text and processing it that way. You need to first check it is a PDF by looking for the PDF header at the start,

%PDF...

Then scan through looking for the phrase,

/XObject

When you hit this tag you need to check backwards and forwards in the stream to the << and >> dictionary boundaries to pull out the full XObject dictionary. There may be nested << and >> so you might want to check back to the 'obj' and forwards to the 'stream' entry. Anyhow you'll end up with something that looks like this,

<< 
/Type /XObject /Subtype /Image /Name /I1 
/Width 800 /Height 128 
/BitsPerComponent 1 /ImageMask true 
/Filter [/FlateDecode] 
/Length 2302 >> 

The thing you need to check here is that there is this /Subtype entry and an /Image separated by some whitespace. If you hit that then you have an image.

So what are the limits of this approach?

Well it is possible to embed an image in the document but not use it. That would result in a false positive. I think this is pretty unlikely though. It would be very inefficient to do so and only a really skanky producer would do it.

Images can be embedded in page content streams as mentioned by Hugo above. That would result in a false negative. These are pretty uncommon though. It's one of those bits of the spec which was never a good idea and it's not widely used. If you have documents from a single producer (as is often the case) it will beome apparent very quickly if it does this or not. However I think it would be pretty uncommon. At a guess I can't imagine that more than 1% of wild PDFs would contain this construct.

It is possible to embed these XObject tags as references rather than direct objects. But I think you can completely discount that. While legal it would be absolutely bizare. I don't think you'll ever see that.

The correct way involves scanning and parsing all the content streams in the PDF. It's what we do in ABCpdf (which I work on) but it is a lot more work and a lot more processing power. It could be many seconds on a large document.

Think if 99% reliability is going to be good enough. :-)

0

Images in PDF are either FormXObjects or embedded images using BI-EI commands into content. So you have to parse Resources dictionary of the page and recursively examine it's Xobjects to check whether they contain an image also(same Resources dictionary). Also you will have to parse all content streams and check whether Embedded image is present. Additionaly images may be defined in Patterns -> it's a way to go if you are going to implement own image presence checker. Read the spec first and estimate the time expenses.3d party lib might be not that expensive at the end.

Hugo Moreno
  • 150
  • 7
  • 1
    can iText not do what you are suggesting? – Anthony Jun 21 '13 at 05:10
  • Bugs do exist everywhere, in iText and other tools.PDF files are also not an exclusion, there lots of malformed, incorrectly created examples.It might be your case, It'd be helpful if you post a sample code you tried to accomplish this task with. – Hugo Moreno Jun 21 '13 at 10:04