Extract text content from docx and pptx that contains text & images - linux

Question

docx to txt:

I tried the following code for extracting text from docx. It does not work when docx has images.

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

For pptx to txt, I found a Perl script to extract txt. It does not work when the pptx has images - the same.

I want extracted txt content for enabling search option among documents. So a command/script that will skip the images and convert the docx text content to txt will even help!

I would prefer linux command but even perl/python script will do. — RPS, Jun 20 '17 at 07:18
If you have motivation, `strings some.docx` and sort it by hand but it will be long and painful — Pantoofle, Jun 20 '17 at 07:26
Why is sort needed? I want extracted text alone. Even skipping out the images is fine. — RPS, Jun 20 '17 at 07:28
You have a great solution in your question. I've been looking for the exact parsing you have to extract the text from docx files (albeit without images.) Bravo. — bballdave025, Apr 16 '19 at 16:08

score 4 · Answer 1 · edited Jul 22 '19 at 22:12

4

The SO question How to extract just plain text from .doc & .docx files? provides other options.
The libreoffice answer almost works, probably did in 2012.
Now (LibreOffice 5.1) try:

libreoffice --convert-to txt text some.docx

or

libreoffice --headless --convert-to txt text some.docx

Be sure that you do not have libreoffice already open.

edited Jul 22 '19 at 22:12

Ry-

218,210
55
464
476

answered Jun 20 '17 at 07:40

Rolf of Saxony

21,661
5
39
60

Extract text content from docx and pptx that contains text & images - linux

1 Answers1