0

I'm trying to use iTextSharp to take a look at some PDFs and check them for irregularities before they are printed. Part of this is checking the images in the PDF for their dpi, transparency and such.

To do this, I loop through the pages and retrieve PdfObjects, which are cast to a PRStream. From this PRStream the PdfName.SUBTYPE is retrieved, and checked to see if this matches PdfName.IMAGE.

This seems logical, to check if the found objects are actually images, but I run into the issue where Subtype is empty, and what seems to be an image in a pdf (I have tested several PDFs of my own as well as PDFs found online) is not considered an image and thus ignored.

Am I using the library incorrectly?

Code snippet:

PdfObject pdfObject = pdfReader.GetPdfObject(i);
//get the object at the index i in the objects collection
if (pdfObject == null || !pdfObject.IsStream()) //object not found so continue
{
    continue;
}
PRStream prStream = (PRStream) pdfObject; //cast object to stream
PdfObject type = prStream.Get(PdfName.SUBTYPE); //get the object type
//check if the object is the image type object
if (type != null && type.ToString().Equals(PdfName.IMAGE.ToString()))
//This if returns false when I expect true

EDIT: As requested, A Pdf that I have used In this case, there are several images on page 2, 4, 5, 6 and 8. However, with the code that I run, it only recocnises a single image on page 5. There are objects found on page 4 and 8, but the SUBTYPE of these objects is null.

CodeGhost
  • 311
  • 1
  • 11
  • At first glance, no, just seems like you're running into the wonderful world of the PDF syntax and the untold numbers of weird pdfs floating around. – Samuel Huylebroeck Oct 23 '17 at 13:07
  • Is there any way to catch more images than iTextSharp does now? With a different library perhaps? – CodeGhost Oct 23 '17 at 13:09
  • There are multiple ways of achieving this. If you can share your PDF, I am can test an alternative route. – Joris Schellekens Oct 23 '17 at 13:35
  • With regards to image formats, all pdf libraries I know of, are in the same league. If you share your pdf, people would have a better idea on how to improve your code. – Amedee Van Gasse Oct 23 '17 at 13:35
  • If you only look at the stream object, you might get the number of pixels of the image, but you will never get the resolution. I'd like to appeal to your common sense to help you understand why not. **A single image stream** can be used as an external object that **appears on many different pages**. The same image can be rendered on different pages (or even on the same page) **in different sizes. This means that the resolution of the image is not stored** (and cannot be stored) **in the image stream.** To calculate the resolution (the dpi), you need more than just the stream. – Bruno Lowagie Oct 23 '17 at 13:41
  • I have added a pdf that you can test with -------------------- Bruno Lowagie I understand. Nonetheless I need to be able to retrieve the images. As I mentioned, there are several checks and not being able to check the DPI does not change my problem. – CodeGhost Oct 23 '17 at 14:17
  • I ran your code against all objects in the PDF and had 5 positives (objects 4, 11, 15, 19, 26, 27). How comes you get only one positive? In other words, your issue cannot be reproduced... – mkl Oct 23 '17 at 14:59

1 Answers1

0

It would appear I used the PDFObject incorrectly (my apologies). I managed to resolve the issue by using some code from this other question I thought this might be the case because of @mkl managing to use my snippet correctly, thus raising the question of where the error was, if not in the given snippet. I had copied someone trying to extract images, but appearantly gave wrong code or had different intentions than I did.

Thank you everyone for helping! As this is my first stackoverflow question, I don't know how to close this and/or award a comment for helping me on my way, sorry

CodeGhost
  • 311
  • 1
  • 11
  • 2
    You can only properly "award" an answer, not a comment (a comment can be upvoted but that does not reflect anywhere but on the comment itself). You can "close" a question by accepting an answer (clicking the tick at the upper left of the answer). – mkl Oct 24 '17 at 04:34