PDFBox PDFImageWrite.writeImage is not handling all characters properly

Question

I am using PDFBox 1.8.10 to load PDFs and to overlay images on each page.

PDDocument doc = PDDocument.load(url);
PDFImageWriter imageWriter = new PDFImageWriter();
imageWriter.writeImage(doc, imageFormat, password, 1,
        doc.getNumberOfPages(), filePrefix, imageType, resolution);

I have tried saving the doc as a PDF and this looks fine. When the images are saved they can contain incorrect text. This is especially true for eastern European documents - eg Hungary, Poland, Czech etc

The PDF shows

H-4432 NYÍREGYHÁZA-NYÍRSZŐLŐS

The image shows

Is there a solution for this? Do I need to define a codepage? Could it be a problem with the available fonts?

See this: http://stackoverflow.com/questions/22260344/pdfbox-encode-symbol-currency-euro — Adam Michalik, Sep 01 '15 at 09:59
PDFBox capabilities in respect to rendering PDFs to images is quite limited in the 1.x versions. It has much improved in the 2.0.0-SNAPSHOT development versions, cf. [this answer](http://stackoverflow.com/a/24238070/1729265), [this answer](http://stackoverflow.com/a/22358240/1729265), and [this one](http://stackoverflow.com/a/21547909/1729265). Unfortunately the PDFBox 2.0.0-SNAPSHOT API is a moving target, massively refactored every other month, so the code in those answers may not work out of the box anymore. — mkl, Sep 01 '15 at 11:25

score 0 · Answer 1 · answered Sep 01 '15 at 13:56

0

The solution for me was to switch over to a 2.0 SNAPSHOT (Aug15). All the documents I've tested look fine. The API has changed but, in my case, it took 5 minutes to make the changes.

Thanks to @mkl for the info.

answered Sep 01 '15 at 13:56

paul

13,312
23
81
144

PDFBox PDFImageWrite.writeImage is not handling all characters properly

1 Answers1