How to count color pages in a PDF/Word doc using Java

Question

I am looking to develop a desktop application using Java to count the number of colored pages in a PDF or Word file. This will be used as part of an overall system to help calculate the cost of printing a document in terms of how many pages there are (color/B&W).

Ideally, the user of the application would use a file dialog to select the desired PRF/Word file, the application could then count and output the number of colored pages, allowing the system to automatically calculate document cost accordingly.

i.e if A4 colored pages cost 50c per page to print, and B&W cost 10c per page, calculate the total cost of the document per colored/B&W pages.

I am aware of the existing software Rapid PDF Count http://www.traction-software.co.uk/rapidpdfcount/, but would be unsuitable as part on integration into a new system. I have also tried using GhostScript/Python as per this solution: http://root42.blogspot.de/2012/10/counting-color-pages-in-pdf-files.html, however this takes too long (5mins to count a 100 page pdf), and would be difficult to implement into a desktop app.

Is there any method of counting the number of colored pages in a PDF or Word file using Java (or alternative language)

Thanks

score 1 · Answer 1 · edited May 23 '17 at 11:58

Although it might sound easy, the task is rather complicated.

One option would be to use a program such as iText to walk every single token in the PDF, look for tokens that support color and compare that to your definition of "black". However, this will only get you basic text and drawing commands. Images are a completely different beast so you'll probably need to find an image parser or grab a copy of each spec and then walk each of those.

One of the downsides of token walking is you need to properly handle tokens that reference other things and further walk those tokens.

Another downside is that things can overlap each other so you'd probably want be aware of their coordinates, z-index, transparency and such.

There will be many more bumps in the road but that's a good start. What's most interesting is that if you accomplish this, you'll actually have found that you've partially built a PDF renderer!

Next, you'll need to define "black". Off the top of my head there's RGB black, CMYK black, Grey black and maybe Lab black along with some Pantones. That shouldn't be too hard but if I were to build this I'd want to know "blank ink usage" which could also be shades of grey. There's also "rich blank" that you might need to deal with, too!

So, all that said, I think that the GhostScript option you found is really the best bet. It literally renders the PDF and calculates the ink coverage from an RGB standpoint. You still should handle grey's, too, but that shouldn't be too hard, here's a good starting point.

Very thorough answer, thanks! Will investigate the iText solution specified, but it looks like some form of Ghostscript implementation is the way forward — m24murray, Jun 05 '15 at 12:10

score 0 · Answer 2 · answered Jun 04 '15 at 20:31

Wanting to know what the click-charge is going to be is a pretty common problem, but it's not easy to solve at all. As already indicated by the answer Chris Haas gave, but I want to put another spin on it.

First of all, you have to wonder whether you really want to support both Word and PDF documents. Analysing Word files is less useful than you might think because that Word file is probably going to be converted into something else before it's going to be printed. And because of the fact that you're starting from Word, the chance that your nice RGB black text in Word gets converted to less-than-perfect 4 color black in PDF is very high. In other words, even though you might count a page of black text in Word as a 'cheap' page, it might turn into an expensive color page after conversion from Word to something that can be printed.

Let's consider the PDF case then. PDF supports a whole host of color spaces (gray, RGB, CMYK, the same with an ICC Profile attached, spot color and a few multi-spot color variants, CalGray and CalRGB and Lab. Besides that there is a whole range of very tricky features such as transparency, overprint, shades, images, masks... that you all have to take into account. The only truly good way to calculate what you need is to do essentially the same work as your printer will do; convert the PDF into one image per page and examine the pixels.

Because of what you want to do, the best way to progress would be to:
1) Convert any word files into PDF
2) Convert any PDF files into CMYK
3) Render each page of that CMYK file into an image.

Once you've done that you can examine the image and see whether you have any colors left. There are a number of potential technologies you can use for this. GhostScript is definitely one, but there are commercial solutions too that would certainly be more expensive but potentially faster.

Hi thanks for your answer, i'm not intending on specifically supporting both PDF and Word, it was just a suggestion in case there was a Word-specific solution. Information on calculating color will be very useful. Thanks — m24murray, Jun 05 '15 at 12:08

How to count color pages in a PDF/Word doc using Java

2 Answers2