How to extract just plain text from .doc & .docx files?

Question

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx?

I've found this - wondered if there were any other suggestions?

This is a perfect fit for Software Recommendations. It should be transferred there. — demongolem, Jun 26 '15 at 15:20
If we have `Software Recommendations` why to do not transfer here? I also search software for similar tasks and do not found there best answer. But could recommend `pandoc` as best solution which even tables convert correctly. So I suggest reopen question. — Hubbitus, Jan 14 '19 at 21:22
You obviously aren't on a Mac, but if you were you could use 'textutil' at the command line to quickly get plain text from various proprietary document types. — dave, Jan 28 '19 at 18:35
This question is being [discussed on Meta](https://meta.stackoverflow.com/questions/383134/should-how-to-extract-just-plain-text-from-doc-docx-files-be-migrated) — TylerH, Apr 25 '19 at 14:42
@Taryn: care to explain why this Q is off-topic but https://stackoverflow.com/questions/8252220/how-to-extract-plain-text-from-ms-word-document-file-in-pure-c is not? — slashmais, Jan 21 '20 at 06:28

score 79 · Answer 1 · answered Sep 02 '14 at 09:46

79

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

answered Sep 02 '14 at 09:46

rob

8,134
8
58
68

30

I like this command, but often newlines are still useful data to have in the final version. Therefore I used the following command instead: `unzip -p document.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'` Note the additional sed argument, replacing XML representations of newlines with the actual newline character, and I edited the last sed argument to not strip newline characters. This makes the above command far more useful for diff-ing Word documents. – Jeff McJunkin Nov 11 '15 at 00:50
Thanks Rob! @Jeff: I agree but the following command works better for me in practice: unzip -p document.docx word/document.xml | sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g' – Tom G Jan 21 '17 at 20:40
Very nice. Is it also possible to edit the XML data inside the Word document without corrupting it? And how? – anaotha Mar 01 '17 at 13:37
How does this fare with non-ASCII characters? Especially the more esoteric character sets? – einpoklum Aug 11 '17 at 21:42
@einpoklum the first bit of the command gets the raw xml so that will work fine. the second bit gets all the none xml tag strings and separates them with a new line. So as long as the sed does not barf on esoteric character sets you should be fine. Please post a reply if you find that is not the case. – rob Aug 15 '17 at 15:32
Good idea for recovering corrupted files. – Gathide Apr 13 '18 at 06:01
this doesn't preserve newline – mending3 Feb 03 '21 at 10:42
I agree with Jeff besides rob's suggestion does loose some text (close to recurring start of lines seem to be suppressed. – louigi600 Sep 20 '21 at 16:44
@jeff-mcjunkin nice snipet, thanks a lot! I think you should add `s//\n/g;` too ;) – kazuser Nov 23 '21 at 10:19

ccpizza · Answer 2 · 2019-04-21T21:51:57.650

LibreOffice

One option is libreoffice/openoffice in headless mode (make sure all other instances of libreoffice are closed first):

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc

For more details see e.g. this link: http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/

For a list of libreoffice filters see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters

Since the openoffice command line syntax is a bit too complicated, there is a handy wrapper which can make the process easier: unoconv.

Apache POI

Another option is Apache POI — a well supported Java library which unlike antiword can read, create and convert .doc, .docx, .xls, .xlsx, .ppt, .pptx files.

Here is the simplest possible Java code for converting a .doc or .docx document to plain text:

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;

public class WordToTextConverter {
    public static void main(String[] args) {
        try {
            convertWordToText(args[0], args[1]);
        } catch (ArrayIndexOutOfBoundsException aiobe) {
            System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
        }
    }

    public static void convertWordToText(String src, String desc) {
        try {
            FileInputStream fs = new FileInputStream(src);
            final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            FileWriter fw = new FileWriter(desc);
            fw.write(extractor.getText());
            fw.flush();
            fs.close();
            fw.close();

        } catch (IOException | OpenXML4JException | XmlException e) {
            e.printStackTrace();
        }
    }
}


# Maven dependencies (pom.xml):

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>3.17</version>
    </dependency>
</dependencies>
</project>

NOTE: You will need to add the apache poi libraries to the classpath. On ubuntu/debian the libraries can be installed with sudo apt-get install libapache-poi-java — this will install them under /usr/share/java. For other systems you'll need to download the library and unpack the archive to a folder that you should use instead of /usr/share/java. If you use maven/gradle (the recommended option), then include the org.apache.poi dependencies as shown in the code snippet.

The same code will work for both .doc and .docx as the required converter implementation will be chosen by inspecting the binary stream.

Compile the class above (assuming it's in the default package, and the apache poi jars are under /usr/share/java):

javac -cp /usr/share/java/*:. WordToTextConverter.java

Run the conversion:

java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt

A clonable gradle project which pulls all necessary dependencies and generates the wrapper shell script (with gradle installDist).

If you're going to add Java options into the mix, I'd like to mention 'my' docx4j (which also handles pptx, xlsx). For text extraction, you'd use https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/TextUtils.java — JasonPlutext, Mar 07 '13 at 02:43
See also question 1686 on Ask LibreOffice about running the command line conversion in parallel with a running LibreOffice instance: http://ask.libreoffice.org/en/question/1686/how-to-not-connect-to-a-running-instance/ — Mihai Capotă, Mar 25 '15 at 15:03
When I tried using libreoffice to convert some docx files, I got this weird error `Error: Please reverify input parameters...`, which I disappeared when I switched to `--convert-to "txt:Text (encoded):UTF8"`, so I'd recommend that (even if you don't have non-ascii characters). — yoniLavi, Oct 07 '16 at 12:06
This is the ideal approach IMO. But to get it working in OSX, I had to uninstall the GUI-installed version, then run `brew install libreoffice`. Then, the command that worked was `soffice --headless ...` instead of `libreoffice --headless ...`. Although this question is closed, it's *the very first* google result, so it might be worth adding this to the answer to help us hapless searchers. — senderle, Apr 22 '19 at 14:21
@senderle: no need to uninstall the existing GUI-installed version — in that scenario the binary is just not available in $PATH; you can still call it on macos e.g. with `/Applications/LibreOffice.app/Contents/MacOS/soffice --headless --help` — ccpizza, Apr 22 '19 at 14:28
@ccpizza true, but I like to cede as much control of my path to package managers as possible. With homebrew, it Just Works. — senderle, Apr 22 '19 at 14:30
@senderle: fair enough; `brew cask info libreoffice` points to the formula at https://github.com/Homebrew/homebrew-cask/blob/master/Casks/libreoffice.rb where you can see it additionally puts a wrapper script under `/usr/local/bin/soffice`. It's useful to know what exactly is going on just in case the formula gets removed, or in case you need a newer version than the one provided by brew. — ccpizza, Apr 22 '19 at 14:35

score 16 · Answer 3 · edited Jan 11 '21 at 19:03

16

Try Apache Tika. It supports most document formats (every MS Office format, OpenOffice/LibreOffice formats, PDF, etc.) using Java-based libraries (among others, Apache POI). It's very simple to use:

java -jar tika-app-1.4.jar --text ./my-document.doc

edited Jan 11 '21 at 19:03

Matthias Braun

32,039
22
142
171

answered Jan 02 '14 at 14:45

molnarg

2,775
1
19
20

score 10 · Answer 4 · edited Nov 16 '22 at 09:08

10

Try "antiword" or "antiword-xp-rb"

My favorite is antiword:

http://www.winfield.demon.nl/

And here's a similar project which claims support for docx:

https://github.com/rainey/antiword-xp-rb/wiki

edited Nov 16 '22 at 09:08

StackzOfZtuff

2,534
1
28
25

answered Apr 15 '11 at 03:14

Chris Eberle

47,994
12
82
119

1

The have used (the upper) antiword many times, but it does not works with docx. From its page: "Antiword converts the binary files from Word 2, 6, 7, 97, 2000, 2002 and 2003 to plain text and to PostScript" – Arpad Horvath -- Слава Україні Jan 05 '18 at 09:51

Andre · Answer 5 · 2013-10-31T22:04:55.247

5

I find wv to be better than catdoc or antiword. It can deal with .docx and convert to text or html. Here is a function I added to my .bashrc to temporarily view the file in the terminal. Change it as required.

# open word in less (ie worl document.doc)
worl() {
    DOC=$(mktemp /tmp/output.XXXXXXXXXX)
    wvText $1 $DOC
    less $DOC
    rm $DOC
}

edited Oct 31 '13 at 22:04

answered Oct 31 '13 at 11:29

Andre

69
1
5

2

For those on OSX, you can `brew install wv && brew install elinks`. – Sean Allred Nov 19 '13 at 23:57
Works a treat and supports .doc and .docx – Steve Childs Jan 12 '16 at 09:53

score 1 · Answer 6 · answered Jul 23 '14 at 16:22

I recently dealt with this issue and found OpenOffice/LibreOffice commandline tools to be unreliable in production (thousands of docs processed, dozens concurrently).

Ultimately, I built a light-weight wrapper, DocRipper that is much faster and grabs all text from .doc, .docx and .pdf without formatting. DocRipper utilizes Antiword, grep and pdftotext to grab text and return it.

How to extract just plain text from .doc & .docx files?

6 Answers6

LibreOffice

Apache POI

Try "antiword" or "antiword-xp-rb"

Linked