0

Can we extract text from MS Word document by using Apache POI as like original document?

As of now I searched I got examples like extracting paragraphs,tables,images separately.

But my need is extracting text as it is.(same manner document as available, mixed with paragraphs, tables and images)

Is it possible?

Arunkumar S
  • 112
  • 2
  • 14
  • 3
    Can you explain what you mean with "like original document"? Do you want to extract the document contents as plain text without formatting? Or do you expect the text to be formatted in any way? If so, what format do you expect it to be in? – Dirk Vollmar Aug 19 '16 at 08:45
  • Plain text with/without formatting but tables, images and all should be in the flow as like original file., – Arunkumar S Aug 22 '16 at 08:55
  • There are several formats that can include tables and images, e.g. RTF or HTML. What do you want to do with the extracted content? If you want to use HTML you should have a look at this [answer](http://stackoverflow.com/a/7901139/40347). – Dirk Vollmar Aug 22 '16 at 09:05
  • Great suggestion @Dirk., Thank you. I want it in HTML only. Your suggestion saved me. I tried the provided sample but it too skips images, tables & contents inside Diagrams(like box). Is there any way to extract that too. I saw few examples to extract images alone separately. But Can we place those diagrams like exactly in the position of original file. (as like original file) – Arunkumar S Aug 22 '16 at 09:29
  • 1
    I doubt there is an easy solution for this. Things such as freely positioned frames and shapes that are anchored to paragraphs in Word have mo direct equivalent in HTML. You might try To automate Word and use Word's own *Save as HTML* functionality or check out a solution based on LibreOffice or the commercial Aspose lib. – Dirk Vollmar Aug 22 '16 at 09:37

0 Answers0