command line to convert all .docx in a directory (and subdirectories) to text file and write new files

Question

I would like to convert all .docx files in a directory (and subdirectories) to text files from the command line (so I can use grep after on these files). I found this

unzip -p tutu.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

here which works well but it sends the file in the terminal. I would like to write the new text file (.txt for instance) in the same directory as the .docx file. And I would like a script to do this recursively.

I have this, using antiword, that do what I want for .doc files but it doesn't work for .docx files.

find . -name '*.doc' | while read i; do antiword -i 1 "${i}" >"${i/doc/txt}"; done

I tried to mix both but without success... A command line that would do both at the same time would be appreciated!

Thank you

score 5 · Answer 1 · answered Jan 15 '17 at 07:48

5

You can use pandoc to convert docx files. It doesn't support .doc files so you will need both pandoc and antiword.

Reusing your while loop:

find . -name '*.docx' | while read i; do pandoc --from docx --to plain "${i}" >"${i/docx/txt}"; done

answered Jan 15 '17 at 07:48

David Duponchel

3,959
3
28
36

thank you. I tried using pandoc but for some reaons it creates empty .txt files with the following warning: pandoc: Unkown reader: docx. Any idea? But the loop is good: it is recursive and creates the file where it was. – jejuba Jan 15 '17 at 08:38
docx support was added in the version [1.13](https://github.com/jgm/pandoc/releases/tag/1.13). Which version do you use ? You may need to [install a recent version](http://pandoc.org/installing.html). – David Duponchel Jan 15 '17 at 10:13
Right, I have version 1.12... It is the one available with the stable Debian version. I have to see if I can install it from testing. Best, – jejuba Jan 15 '17 at 19:19

hansaplast · Accepted Answer · 2017-01-23T07:29:04.657

2

The following script..

converts all docx files in the directory where you run it, recursively (adapt . in find . to your wished starting point)
writes the txt files to where it found the docx file

Bash script:

find . -name "*.docx" | while read file; do
    unzip -p $file word/document.xml |
        sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > "${file/docx/txt}"
done

Afterwards you can run the grep like this:

grep -r "some text" --include "*.txt" .

edited Jan 23 '17 at 07:29

answered Jan 15 '17 at 06:56

hansaplast

11,007
2
61
75

Thank you. It looks like it is not working recursively though. And, instead of creating a new directory, i'd like it to write the file to the directory where it found the .docx file. Any adjustment? – jejuba Jan 15 '17 at 07:20
@jejuba changed the script so it is starting at the current directory. It *does* work recursively (also the old version). Changed it so it stores the txt where it found the docx. Also the grep is a bit more complex now, as you need to do it recursively too – hansaplast Jan 15 '17 at 13:04
OK thanks it works well. The problem is that I have .docx files that are not truly .docx files. I have to sort this out. Many thanks. – jejuba Jan 15 '17 at 19:18
@jejuba if you want then upload the docx file and link it here and I can help you with converting it to txt – hansaplast Jan 15 '17 at 19:24
Ok, I understand what went wrong. I kept having the following message: ambiguous redirect. The script would not go in a directory if there was a space in it. I added quotes around ... > "${file/docx/txt}" and now it's working fine! THanks – jejuba Jan 22 '17 at 21:37
@jejuba thanks for the hint, I corrected the script above – hansaplast Jan 23 '17 at 07:29

command line to convert all .docx in a directory (and subdirectories) to text file and write new files

2 Answers2