Convert PDF to Word in Java

Question

Is it possible to convert PDF to Word in Java? I'm not talking about parsing a PDF document and then custom render it again to Word. I want a Java library that can directly convert it.

Like everyone else, I don't think you're going to have much luck. If you have control of the system generating the PDFs then you can have it generate multiple formats at generation time (you haven't said where the PDFs come from). Is that an option? — Paul Jowett, Nov 08 '10 at 06:38
I hear you about wanting to do it "directly", but in the absence of a single (open-source?) library, you could try extraction with http://pdfbox.apache.org/ and create the docx with docx4j. YMMV: Google pdfbox "Paragraph boundary segmentation" — JasonPlutext, Nov 17 '10 at 03:26

score 4 · Answer 1 · answered Nov 03 '10 at 18:12

4

Reading PDF documents is a very involved process and there are no good free libraries for extracting non-text information from PDF documents in Java. Worse yet, PDF documents have a lot of layout information that is hard to reconstruct, for example a table in a Word document becomes some lines and a bunch of pieces of text in PDF.

answered Nov 03 '10 at 18:12

Michael Shopsin

2,055
2
24
43

1

"a lot of layout information that is hard to reconstruct" is misleading. There IS NO LAYOUT INFORMATION. Everything in a PDF is absolutely positioned. There's no such thing as a table, it's just lines, characters ("glyphs" really), and maybe some bitmaps. Heck, "text" can just be lines too. None to efficient, but perfectly "legal". – Mark Storer Nov 03 '10 at 18:25
1

EXCEPTIONS to my comment: There's this stuff called "marked content" that is optional within PDFs. When it's there, it can mark up tables, paragraphs, etc. But there's no standard way to write out the kind of detail you'd need for a reliable PDF->N format conversion. PDF is all but a write-only format. – Mark Storer Nov 03 '10 at 18:26
Having tried to do some PDF reconstruction there is some visual layout information in PDF but no real structured information. I agree with you Mark that PDF conversion to anything non image is very hard. – Michael Shopsin Nov 04 '10 at 13:43

score 2 · Answer 2 · answered Nov 03 '10 at 18:31

2

It is almost impossible to recreate semantic information from an arbitrary PDF. If you have the same tool that wrote it you have somewhat more chance but even so there is much uncertainty. The only thing you can be sure of in a (text) PDF is the position of each character on the page. (Note that some PDFs include bitmaps in which textual information occurs and that has to rely on OCR).

There are several groups in computer science departments and elsewqhere who are spending very significant effort to try and get semantic information. We collaborate with Penn State - one of the leaders - and they are working on extracting tables. In good casees they get 90% in bad ones 50%.

So the answer is formally that you cannot, but you may occasionally be fortunate. (We do a lot of this for chemistry and count ourselves lucky if we get 50% on a regular basis).

answered Nov 03 '10 at 18:31

peter.murray.rust

37,407
44
153
217

I've never understood this...just keep the original documents. @.@ I'm sure there are probably times when it's necessary, but really, the entire point of a PDF is a finalized, non-editable document. `` – Kevin Coppock Nov 03 '10 at 18:40
@kcoppock. This is when you need someting from asomeone else's document. For example I want data from the scientific literature. Although the publishers have the XML they generally refuse to make it available, so we have to try to extract from the PDF. In many cases people have to retype stuff or redraw graphs. – peter.murray.rust Nov 03 '10 at 19:05
I can understand that, but at the same time, typically that means that the publisher doesn't want you using the content. – Kevin Coppock Nov 03 '10 at 19:15
1

@kcoppock. I am well known in science for challenging this view, but Stackoverflow isn't the best place to discuss it! There are, however many cases where it is legitimate to do this. – peter.murray.rust Nov 03 '10 at 19:31

score 0 · Answer 3 · answered Nov 07 '12 at 08:40

You can try to do it with the iText library. Read the PDF and then write it as an RTF.
This is not that simple though, as you have to preserve the different style that the PDF has.
You can use some external tools.
Install some free program like "Free PDF to Doc" and execute it from you java program.
This Works fine in most cases.
use the Acrobat Pro SDK from you java code.

Best of luck

Convert PDF to Word in Java

3 Answers3

Linked