1

docx to txt:

I tried the following code for extracting text from docx. It does not work when docx has images.

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

For pptx to txt, I found a Perl script to extract txt. It does not work when the pptx has images - the same.

I want extracted txt content for enabling search option among documents. So a command/script that will skip the images and convert the docx text content to txt will even help!

RPS
  • 69
  • 1
  • 9

1 Answers1

4

The SO question How to extract just plain text from .doc & .docx files? provides other options.
The libreoffice answer almost works, probably did in 2012.
Now (LibreOffice 5.1) try:

libreoffice --convert-to txt text some.docx

or

libreoffice --headless --convert-to txt text some.docx

Be sure that you do not have libreoffice already open.

Ry-
  • 218,210
  • 55
  • 464
  • 476
Rolf of Saxony
  • 21,661
  • 5
  • 39
  • 60