I am working on trying to make a little console application that can tell the difference between a blank page, and a page with only one image on it and no text. For the blank page I'm using iTextsharp's SimpleTextExtractionStrategy() and that works great for the blank pages. The problem is the pages with only the image on them are considered to be blank by this method as well. Is there a way to know if a pdf has an image on it similar to the way I am checking for text on the pages?
Asked
Active
Viewed 956 times
0
-
The iTextSharp parser framework does inform about bitmap images, too. `SimpleTextExtractionStrategy` ignores these callbacks but you can easily derive a new listener class which also checks whether or not there were images. – mkl Jul 21 '15 at 19:21
-
Thank you for getting back with me. I started looking at posts about how people retrieved images from pdf files and I learned i can use PdfDictionary then from there use .getPdfObject and from there i can find out if the page contains the image or not. – Mitch Jul 21 '15 at 19:24
-
That sounds like posts telling people to merely look at the image resources of the page. This is not a good solution, though, image resources not necessarily are used on a page, and inline images are not found. – mkl Jul 21 '15 at 20:27
-
You could essentially walk the contents of a PDF using the text extraction strategies, implementing each possible thing that could draw something but I think it would be way easier to actually draw something and just look for non-white pixels. I'd just rasterize a [PDF to a bitmap format](http://stackoverflow.com/questions/14995170/pdf-to-png-with-high-resolution) and then look for [non-white pixels](http://stackoverflow.com/a/10334320/231316) – Chris Haas Jul 21 '15 at 23:10
-
For what purpose do you try to find out whether some page is blank? Is it about the page *looking* empty, being completely white? Or is it about the page *being* empty? In the former case there still could be text rendered invisibly or in white or completely white or transparent images, but not in the latter case. – mkl Jul 22 '15 at 07:42
-
I know for some blank pages I have encountered one or two lines of text that were written by the program creating them. Basically it is being used for a tool that will go through a whole bunch of flyers and tell the people printing them how many of those pages in the pdf will be blank and how many will only have the backer image on them. I appreciate you guys taking the time to answer back! – Mitch Jul 22 '15 at 12:36
-
As it is for printing purposes you should really consider @Chris' approach of rendering the page as image and checking the image for non-whites. – mkl Jul 22 '15 at 13:33