How to extract images from pdf using Java (not using pdfbox)

Question

I've being researching on how to extract images from a big (> 300MB) PDF file. I'm using pdfbox but for some particular reason that I can't figure out, some pages are not correctly extracted.

I'm using the PDFToImage class of pdfbox as base for my code.

So, do you know another library that may help me to do this? I know that iText may be used, but I read that it can't be used for commercial products.

I've installed the packages xpdf and xpdf-utils, and the utility called pdfimages is working perfect. But I need to solve this problem from Java and it should be portable.

iText is under GPL unless you purchase a commercial license. — Thorbjørn Ravn Andersen, Nov 30 '10 at 16:16
I will try with versions < 5, I think that the change on the licensing terms were changed for versions >= 5. — Claudio Acciaresi, Nov 30 '10 at 16:56
What is wrong with the images that aren't correctly extracted? — Mark Storer, Nov 30 '10 at 17:02
In two particular pages, that are composed from different embedded images, the output for each page is wrong.. is hard to describe..... The different embedded images are letters and the final page has like holes between the letters.... — Claudio Acciaresi, Nov 30 '10 at 17:39

score 6 · Accepted Answer · edited Jan 17 '20 at 08:23

6

I think you're talking about two different things here: extracting images from a PDF, and converting PDF pages to images. PDFToImage will output an image for every page, while pdfimages extracts all embedded images (e.g. a text document has 0 images).

Take a look at org.apache.pdfbox.tools.ExtractImages (source code) to see if it does what you want.

edited Jan 17 '20 at 08:23

Roland

22,259
4
57
84

answered Nov 30 '10 at 16:23

erjiang

44,417
10
64
100

Yes, you are right, I'm trying to convert a PDF page to an image, not to extract all the embedded images. The thing is that the PDF that I'm using in this particular case has one image per page. Sorry for the misunderstanding. I've also check ExtractImages with no luck. – Claudio Acciaresi Nov 30 '10 at 16:42
I've finally used pdfbox, the thing is that pdfbox will not extract to images correctly PDFs which has fonts that are not recognized, or CMYK colorspace. For PDFs without these problems, the library works ok. – Claudio Acciaresi Jan 07 '11 at 14:41

score 0 · Answer 2 · answered Nov 30 '10 at 16:17

0

The most likely reason why it is hard working with 300 Mb PDF's is that you run out of memory. If it works well for smaller PDF's I would have a closer look at why it fails.

answered Nov 30 '10 at 16:17

Thorbjørn Ravn Andersen

73,784
33
194
347

score 0 · Answer 3 · edited Feb 23 '11 at 04:25

0

Have you tried icepdf or JPedal (both pure java)?

edited Feb 23 '11 at 04:25

malaverdiere

1,527
4
19
36

answered Nov 30 '10 at 16:50

mark stephens

3,205
16
19

Nop, I did not. Can I use any of those in a commercial Product? – Claudio Acciaresi Nov 30 '10 at 16:51
They both have LGPL and commercial versions. You can use either in a commercial product. – mark stephens Nov 30 '10 at 18:25
I've tested Icepdf, the pages are extract ok, but I'm having problems with fonts now :(. I'm using this as a guide: http://wiki.icefaces.org/display/PDF/Converting+PDF+Page+Renderings – Claudio Acciaresi Nov 30 '10 at 18:44

How to extract images from pdf using Java (not using pdfbox)

3 Answers3

Linked