6

How can I convert a pdf file to word file using Java?

And, is it as easy as it looks like?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Gentuzos
  • 265
  • 2
  • 6
  • 14

2 Answers2

11

Try PDFBOX

public class PDFTextReader
{
   static String pdftoText(String fileName) {
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File(fileName);
        if (!file.isFile()) {
            System.err.println("File " + fileName + " does not exist.");
            return null;
        }
        try {
            parser = new PDFParser(new FileInputStream(file));
        } catch (IOException e) {
            System.err.println("Unable to open PDF Parser. " + e.getMessage());
            return null;
        }
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc);
        } catch (Exception e) {
            System.err
                    .println("An exception occured in parsing the PDF Document."
                            + e.getMessage());
        } finally {
            try {
                if (cosDoc != null)
                    cosDoc.close();
                if (pdDoc != null)
                    pdDoc.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return parsedText;
    }
    public static void main(String args[]){

         try {

            String content = pdftoText(PDF_FILE_PATH);

            File file = new File("/sample/filename.txt");

            // if file doesnt exists, then create it
            if (!file.exists()) {
                file.createNewFile();
            }

            FileWriter fw = new FileWriter(file.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write(content);
            bw.close();

            System.out.println("Done");

        } catch (IOException e) {
            e.printStackTrace();
        }
    } 
}
newuser
  • 8,338
  • 2
  • 25
  • 33
  • Download the jar : http://mirror.nexcess.net/apache/pdfbox/1.8.2/pdfbox-1.8.2.jar – newuser Aug 01 '13 at 06:31
  • 2
    thank you very much, but what about images and mathematical characters please? Will I need to convert this to Word file directly ? – Gentuzos Aug 01 '13 at 06:37
  • 1
    Your PDF having images and mathematical characters – newuser Aug 01 '13 at 06:38
  • 1
    Yes, but converting it to a text file, it can't resolve this issue. – Gentuzos Aug 01 '13 at 06:42
  • Because, it only read text from the PDF. But image is possible only by OCR. – newuser Aug 01 '13 at 06:45
  • Use this to read text from images http://code.google.com/p/tesseract-ocr/ – newuser Aug 01 '13 at 06:57
  • the problem is that I'm trying to encode a console application for fun to convert my pdf file, it contains images, special characters, italics, etc.. Do you know another library to achieve this please? Otherwise thank you very much for the code and the library that you have provided me, they help me later I think ;) – Gentuzos Aug 01 '13 at 07:08
  • Its only done through by OCR. – newuser Aug 01 '13 at 07:12
  • Can I integrate it into the code ? – Gentuzos Aug 01 '13 at 07:14
  • Take a look, http://stackoverflow.com/questions/1442608/read-text-from-image-file-in-java – newuser Aug 01 '13 at 07:17
  • I don't find any examples about it :/ – Gentuzos Aug 01 '13 at 09:21
  • If you need the exact answer, you simply modify the question. Because you asked how to read the pdf. so every one thing that way. they dont know what do you want. – newuser Aug 01 '13 at 10:21
  • 2
    Oh, because OCR is too difficult to do the process and it takes long time to extract the image content. I use the JPedal jar http://www.idrsolutions.com/demo-landing-page/ simply run the jar through your console – newuser Aug 01 '13 at 10:24
  • You're right, but can you help me please to find some examples about this extract. I found [this](http://files.idrsolutions.com/samplecode/org/jpedal/examples/images/ExtractImages.java.html) but it wouldn't work for me, Eclipse can't find this package **com.sun.media.jai** – Gentuzos Aug 01 '13 at 14:58
  • 1
    Its a 3rd party jar. You can download the com.sun.media jar http://www.java2s.com/Code/Jar/s/Downloadsunjaicodecjar.htm – newuser Aug 02 '13 at 01:21
  • You are always welcome. – newuser Aug 02 '13 at 02:34
6

I have looked deeply into this matter and I found that for proper results, you need cannot avoid using MS Word. Even funded projects such as LibreOffice struggle with the proper conversion as the Word format is rather complex and changes over the versions. Only MS Word keeps track of this.

For this reason, I implemented documents4j what delegates conversions to MS Word using a Java API. Furthermore, it allows you to move the conversions to a different machine which you can contact using a REST API. You find detailed information on its GitHub page.

Rafael Winterhalter
  • 42,759
  • 13
  • 108
  • 192
  • 1
    `The type com.documents4j.job.AbstractConverterBuilder cannot be resolved. It is indirectly referenced from required .class files ` and that type does not exist in the javadoc reference – 0x777 Sep 11 '16 at 23:50
  • 1
    Seems like your class path is incomplete. The javadoc only contains the official API classes. – Rafael Winterhalter Sep 12 '16 at 07:04