5

In Java, I would like to be able to read in a PDF file, test whether it is PDF/A (PDF for Archiving) compliant, and if not, then convert the file to PDF/A.

I might prefer this in Apache PDFBox because I've been doing a few things in that API already, but I'd be open to other APIs as well.

K J
  • 8,045
  • 3
  • 14
  • 36
user553702
  • 2,819
  • 5
  • 23
  • 27
  • 1
    Which PDF/A flavors do you want to convert to? Some are difficult as explained by @Tilman's answer and others are even more difficult, especially if no human assistance shall be required... – mkl Aug 03 '16 at 14:39

2 Answers2

6

Test whether a PDF file is PDF/A-1b can be done with PDFBox preflight, see example here or use the preflight-app.

Creating a tool to convert a file from PDF to PDF/A is a difficult task that would take months, possibly years. If you look at the source code of PDFBox preflight, you'll find hundreds of error messages. So your tool would have to be able fix each of these errors. Some are:

  • non embedded font
  • use of color without output intent
  • improper meta data
  • JBIG2 encoded image
  • LZW encoded data

Just check a few of your own files with PDFBox preflight, and you'll see a wide variety of problems...

If you don't have months or years, visit the homepages of Callas Software GmbH or PDF Tools AG to buy such a converter.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • 2
    I'd like to second Tilman. The issue is not only about the technical steps but also about making sure that there is a good understanding of the PDF and PDF/A specification. There were several discussions about how to read and interpret the spec different vendors agreed to which is reflected in their tools. Although you can use PDFBox to build a conversion it might be more (cost) effective to purchase and established converter. Keep in mind that it's not always possible to convert an arbitrary PDF to PDF/A – Maruan Sahyoun Aug 03 '16 at 12:46
-1

I've been working on a easy way to convert PDF to PDF/A. Finally I convert every page of the original PDF to images and I recreate the PDF just using images.

This way I don't care about fonts, forms or any other configuration.

public void usingImages(File pdfFile) {
    try (PDDocument docIn = PDDocument.load(pdfFile))
    {   
        try(PDDocument docOut = new PDDocument()) {
            PDFRenderer pdfRenderer = new PDFRenderer(docIn);
            for (int pageIx = 0; pageIx < docIn.getNumberOfPages(); ++pageIx) { 
                //convert the input page to img
                BufferedImage bim = pdfRenderer.renderImageWithDPI(pageIx, 300, ImageType.RGB);
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                ImageIO.write(bim, "png", baos);
                byte[] toByteArray = baos.toByteArray();
                
                //Create new output page
                PDPage page = new PDPage();
                docOut.addPage(page);
                
                //Insert Image to new page
                PDImageXObject pdImage = PDImageXObject.createFromByteArray(docOut, toByteArray, "Pagina_"+String.valueOf(page));
                
                try (PDPageContentStream contentStream = new PDPageContentStream(docOut, page, PDPageContentStream.AppendMode.APPEND, true, true))
                {
                    // contentStream.drawImage(ximage, 20, 20 );
                    // better method inspired by http://stackoverflow.com/a/22318681/535646
                    // reduce this value if the image is too large
                    float width = page.getCropBox().getWidth();
                    float height = page.getCropBox().getHeight();
                    float scale = width / pdImage.getWidth();
                    if (scale > (height / pdImage.getHeight()))
                        scale = height / pdImage.getHeight();

                    contentStream.drawImage(pdImage, page.getCropBox().getLowerLeftX(), page.getCropBox().getLowerLeftY(), pdImage.getWidth() * scale, pdImage.getHeight() * scale);
                }
            }
            docOut.save(new File(pdfFile.getAbsolutePath() + ".PDFA.pdf"));
        }
    } catch (Exception ex) {
        Logger.getLogger(PDFtoPDFA.class.getName()).log(Level.SEVERE, null, ex);
    }
}
Ivan
  • 62
  • 2