Text is missing when converting pdf file into image in java using pdfbox

Question

I want to convert a PDF page to image file. Text is missing when I convert a PDF page to image using java.

The file which I want to convert 46_2.pdf after converting it shown me like 46_2.png

Code:

import java.awt.image.BufferedImage;
import java.io.File;
import java.util.List;

import javax.imageio.ImageIO;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;

public class ConvertPDFPageToImageWithoutText {
    public static void main(String[] args) {
        try {
            String oldPath = "C:/PDFCopy/46_2.pdf";
            File oldFile = new File(oldPath);
           if (oldFile.exists()) {

            PDDocument document = PDDocument.load(oldPath);
            List<PDPage> list = document.getDocumentCatalog().getAllPages();

            for (PDPage page : list) {
                BufferedImage image = page.convertToImage();
                File outputfile = new File("C:/PDFCopy/image.png");
                ImageIO.write(image, "png", outputfile);
                document.close();
            }

        }

    } catch (Exception e) {
        e.printStackTrace();
    }
}
}

I'd try using the convertToImage( type, resolution ) method and see what you get. I bet you're going to have to tinker with the resolution a few times to get it right. http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/pdmodel/PDPage.html#convertToImage(int, int) — Robert Beltran, Jan 11 '14 at 06:44
The 1.8.x versions have deficiencies with font rendering. These have been solved in the unreleased 2.0 version, which you can get with svn from the repository, and the build with maven. — Tilman Hausherr, Nov 08 '14 at 23:50
https://pdfbox.apache.org/downloads.html#scm Note that the API is different (especially rendering), so look at the examples to see how it is done. — Tilman Hausherr, Nov 11 '14 at 09:51

score 2 · Answer 1 · edited May 23 '17 at 12:24

2

Since you're using PDFBox, try using PDFImageWriter.writeToImage instead of PDPage.convertToImage. This post seems relevant to what you are trying to do.

edited May 23 '17 at 12:24

Community

1
1

answered Jan 11 '14 at 06:41

chairbender

839
6
14

1

Ah, too bad. According to [this link](http://mail-archives.apache.org/mod_mbox/pdfbox-users/201307.mbox/%3Cdef98071-3cb6-4fa2-9dd4-1ea2efcaa0ee@email.android.com%3E) ...it seems there are known issues with certain fonts. – chairbender Jan 11 '14 at 06:54
[PDFImageWriter.writeToImage](http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/PDFImageWriter.html#writeImage%28org.apache.pdfbox.pdmodel.PDDocument,%20java.lang.String,%20java.lang.String,%20int,%20int,%20java.lang.String%29) gives me same output. – UdayKiran Pulipati Jan 11 '14 at 07:37
I understand. I'm telling you that PDFBox [apparently has issues with some fonts](http://mail-archives.apache.org/mod_mbox/pdfbox-users/201307.mbox/%3Cdef98071-3cb6-4fa2-9dd4-1ea2efcaa0ee@email.android.com%3E), so I don't think you'll be able to get PDFBox to successfully preserve that text until the developers fix pdfbox. – chairbender Jan 11 '14 at 07:45
1

`PDFImageWriter.writeImage()` uses `PDPage.convertToImage()` internally and just saves resulted BufferedImage into file system. – Nikita Bosik Dec 08 '15 at 13:45

score 1 · Answer 2 · answered Jan 13 '14 at 14:24

I had the same problem. I found an article(unfortunally can't remember where because I've read hundred of them). There an author complained that appeared such problems in PDFBox after they updated the Java version to 7.21. So I'm using 7.17 and it works for me:)

score 0 · Answer 3 · answered May 04 '18 at 10:45

Use the latest version of PDFBox(I am using 2.0.9) and add JAI Image I/O dependency from here. This is sample running code on JAVA 7.

    public void pdfToImageConvertorUsingPdfBox(String inputPdfPath) throws Exception {
    File sourceFile = new File(inputPdfPath);
    String formatName = "png";
    if (sourceFile.exists()) {
        PDDocument document = PDDocument.load(sourceFile);
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        int count = document.getNumberOfPages();

        for (int i = 0; i < count; i++) {
            BufferedImage image = pdfRenderer.renderImageWithDPI(i, 200, ImageType.RGB);
            String output = FilenameUtils.removeExtension(inputPdfPath) + "_" + (i + 1) + "." + formatName;
            ImageIO.write(image, formatName, new File(output));
        }
        document.close();
    } else {
        logger.error(sourceFile.getName() + " File not exists");
    }
}

Text is missing when converting pdf file into image in java using pdfbox

3 Answers3

Linked