8

I can not detect blank page in pdf file. I have searched internet for it but could not find a good solution.

Using Itextsharp I tried with page size, Xobjects. But they do not give exact result.

I tried

if(xobjects==null || textcontent==null || size <20 bytes )
  then "blank"
else
 not blank

But maximum time it returns wrong answer. I have used Itextsharp

The code is below... I am using Itextsharp Librabry

For xobjects

PdfDictionary xobjects = resourceDic.GetAsDict(PdfName.XOBJECT);
//here resourceDic is PdfDictionary type
//I know that if Xobjects is null then page is blank. But sometimes blank page gives xobjects which is not null.

For contentstream

 RandomAccessFileOrArray f = reader.SafeFile;
 //here reader = new PdfReader(filename);

 byte[] contentBytes = reader.GetPageContent(pageNum, f);
 //I have measured the size of contentbytes but sometimes it gives more than 20 bytes for   blank page

For textcontent

String extractedText = PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy());
  // sometimes blank page give a text more than 20 char length .
Md Kamruzzaman Sarker
  • 2,387
  • 3
  • 22
  • 38
  • What _do_ you get on a page you know to be blank? (Edit this detail into your answer, rather than appending substantial detail in the comments). – halfer Jun 10 '12 at 09:41
  • That's a good question now. I don't know the answer, since I've not done any PDF parsing before. Have you analysed those three categories of object to see if empty pages have something in common? For example, what text content actually appears on a blank page? – halfer Jun 10 '12 at 13:21
  • Text in blank pages are different from pdf to pdf. I can not find any similarities still now. – Md Kamruzzaman Sarker Jun 10 '12 at 13:25
  • Err, can you provide an example, or do I have to guess? – halfer Jun 10 '12 at 13:25
  • I found the text in blank page is 01 557599 FM.qxd 4/29/04 11.32AM Page ii – Md Kamruzzaman Sarker Jun 10 '12 at 13:36
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12354/discussion-between-md-kamruzzaman-pallob-and-halfer) – Md Kamruzzaman Sarker Jun 10 '12 at 13:50
  • PDF pages which *seem* to be blank can always contain objects which are not visible (or not printed). The most simple one is the infamous 'white text on white background' example. (You could still highlight the text and copy it though...) A more complex one is a page consisting of different layers, where you by default set to visible (and printing) only an empty layer. -- So... **the only way to reliably discover if a page is *empty* is to 'print' it or 'print it virtually'**. This is what my suggested solution (using Ghostscript) does. – Kurt Pfeifle Jul 18 '12 at 08:11

3 Answers3

2

A very simple way to discover empty pages is this: use a Ghostscript commandline that calls the bbox device.

Ghostscript's bbox calculates the coordinates of that minimum rectangle 'bounding box' which encloses all points of the page where a pixel would be rendered:

gs \
  -o /dev/null \
  -sDEVICE=bbox \
   input.pdf

On Windows:

gswin32c.exe ^
  -o nul ^
  -sDEVICE=bbox ^
   input.pdf

Result:

GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 6.
Page 1
%%BoundingBox: 27 281 548 804
%%HiResBoundingBox: 27.000000 281.000000 547.332031 804.000000
Page 2
%%BoundingBox: 0 0 0 0
%%HiResBoundingBox: 0.000000 0.000000 0.000000 0.000000
Page 3
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 4
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 5
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000
Page 6
%%BoundingBox: 27 302 568 814
%%HiResBoundingBox: 27.949219 302.000000 567.332031 814.000000

As you can see, page 2 of my input document was empty.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • I am new with pdf. In command windows I run it but when I run command as a batch file then page number i get as output file nothing else. How can I overcome this problem. @Kurt Pfeifle – Mohammad Arifuzzaman Jul 03 '12 at 14:11
  • @lazy king: The reason could be this: (1) all lines starting with `Page ` are directed to `stdout`. (2) all lines starting with `%%BoundingBox: ` or `%%HiResBoundingBox: ` are directed to `stderr`. In the command window the stderr output could possibly be suppressed if you run the command from inside a batch file... – Kurt Pfeifle Jul 03 '12 at 14:33
  • @lazy king: You could try this `gswin32c.exe -sstdout=%stderr -o nul -sDEVICE=bbox "input.pdf" 2>output.txt` to re-direct all batch file output to a text file named *output.txt*. Is that what you want? – Kurt Pfeifle Jul 03 '12 at 15:05
1

I suspect you have tried .Trim() on your strings, so I won't suggest that on it's own.

What is the actual contents of the 20+ char length strings in the blank? I suspect it is just new line characters (like what happens when people press enter 10+ times just to get a new page rather than inserting a page-break), in which case:

String extractedText = 
    string.Replace(string.Replace(
        PdfTextExtractor.GetTextFromPage(reader, pageNum, new LocationTextExtractionStrategy())
    , Environment.NewLine, ""), "\n", "").Trim();

Let us know what the output contents is after this.

Another possibility is that it's blank text with non-breaking spaces and other characters that aren't actually spaces, you'll need to find and replace these manually.. at which point I would instead suggest that you actually just use a regex match for [0-9,a-z,A-Z] and use that to determine if your page is blank or not.

Seph
  • 8,472
  • 10
  • 63
  • 94
  • I do not trim string. I just get the string and show it. I can give you an pdf file which has blank page but it's text is 01 557599 FM.qxd 4/29/04 11.32AM Page ii – Md Kamruzzaman Sarker Jun 10 '12 at 13:34
  • That's likely from hidden fields in the header / footer.. if you can remove header and footers from the document before getting the text from the page that might be a good option. – Seph Jun 10 '12 at 13:37
  • Otherwise you might need to render the pages to images and compare if they're blank or not (unless the pages have background images or watermarks or similar).. I look forward to seeing if someone else has a more solid suggestion than that though (if you can't remove the header and footer sections of each page). – Seph Jun 10 '12 at 13:38
  • Header or footer remove is not posible. I can only add header or footer. – Md Kamruzzaman Sarker Jun 10 '12 at 13:45
  • After rendering to image how can I check that it is blank? Because I want to do it programmatically not by seeing with my eye. I am trying to know a perfect solution. – Md Kamruzzaman Sarker Jun 10 '12 at 13:47
  • can you read only the non-header/footer content at all? try using http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/default.aspx it comes in evaluation version – Seph Jun 10 '12 at 13:50
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12355/discussion-between-md-kamruzzaman-pallob-and-seph) – Md Kamruzzaman Sarker Jun 10 '12 at 13:54
-1

There is a wrapper library for C# and VB.NET from a mupdf c++ library. You could use it to convert to pages to bmp (in diferent formats tif, jpg, png) and check the size of the bitmap.

You should check which is the minimal size with the minimal characters of a page that you will consider as a blank.

Simon Adcock
  • 3,554
  • 3
  • 25
  • 41