28

We're searching a programm that allows us to convert a doc or docx document to a txt file. We're working with linux and we want to start a website that converts user uploaded doc files. We don't wanna use open office/libre office cause we have bad experience with that. Pandoc can't handle doc files :/

Anyone have a idea?

Kara
  • 6,115
  • 16
  • 50
  • 57
user698601
  • 598
  • 2
  • 5
  • 11

4 Answers4

27

You will have to use two different command-line tools, depending if you are working with .doc or .docx format.

For .doc use catdoc:

catdoc foo.doc > foo.txt

For .docx use docx2txt:

docx2txt foo.docx

The latter will produce a file called foo.txt in the same directory as the original.

I'm not sure which Linux distribution you are using, but both catdoc and docx2txt are available from the Ubuntu repositories, for example:

apt-get install docx2txt

Or with Homebrew on Mac:

brew install docx2txt
David Wolever
  • 148,955
  • 89
  • 346
  • 502
harlandski
  • 396
  • 3
  • 6
  • Thanks for the info, unfortunately for me brew install docx2txt didn't work, 'catdoc' command is not available and I need to use 'docx2txt.sh' instead of 'docx2txt'. – Barney Szabolcs Nov 17 '19 at 11:04
  • It turns out catdoc got delegated to the boneyard but one can build it from source, details here: https://apple.stackexchange.com/a/294259/36790 – Barney Szabolcs Nov 17 '19 at 11:09
1

here is a perl project which claims to do it. I have done a lot of this by hand also, using XSLT on the document.xml. the Docx file itself is just a zip file, you can unzip it and inspect the elements. I will say that this is not hard to do for specific files, but is very hard to do in the general case, because of the lack of documentation for how Word internally stores things, and the variance of internal representation.

Paul Sanwald
  • 10,899
  • 6
  • 44
  • 59
0

For doc files you may use antiword, it's available on Homebrew and Ubuntu.

Mishari
  • 316
  • 3
  • 14
0

You can also use pandoc:

Keep the layout (newline as in the visualization of the document):

pandoc -s mydocument.docx  -o ouput.txt

Newline only when the original text has a newline command:

pandoc --wrap=none -s mydocument.docx  -o ouput.txt
G M
  • 20,759
  • 10
  • 81
  • 84