-1

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files and store them in another folder.

How can I do it?

Does there exist a module that can do this?

The Doctor
  • 332
  • 2
  • 5
  • 16
  • 1
    Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Please read [What topics can I ask about here?](https://stackoverflow.com/help/on-topic) – Mike Scotty Jun 26 '17 at 13:01
  • Edited the question. – The Doctor Jun 26 '17 at 13:04
  • Duplicate of https://stackoverflow.com/questions/685533/python-convert-microsoft-office-docs-to-plain-text-on-linux – Spacedman Jun 26 '17 at 13:07
  • While I'm not sure about the ins-and-outs of a Python library designed to do this, an important thing to remember about "new" MS Office documents is that they're not actually files but .zip folders. If you replace the .docx extension with .zip and unzip it, you'll see a directory containing folders. You'll want to go to the `word/` directory, then open `document.xml`. This contains an unformatted version of the text of your document. I recommend looking through [PyPI](https://pypi.python.org/pypi) for a package that can parse XML and extract the text so it can be written into a .txt – nerdenator Jun 26 '17 at 13:12
  • There are many programs out there that convert docx files (and if you are on windows, you can use word apis to do it). The python part is likely just a little bit that reads the directory and uses `subprocess` to run the tool. – tdelaney Jun 26 '17 at 13:34

3 Answers3

3

I figured this would make an interesting quick programming project. This has only been tested on a simple .docx file containing "Hello, world!", but the train of logic should give you a place to work from to parse more complex documents.

from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree

# command format: python3 docx_to_txt.py Hello.docx

# let's get the file name
zip_dir = sys.argv[1]
# cut off the .docx, make it a .zip
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
# make a copy of the .docx and put it in .zip
copyfile(zip_dir, zip_dir_zip_ext)
# unzip the .zip
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
# get the xml out of /word/document.xml
data = etree.parse('./temp/word/document.xml')
# we'll want to go over all 't' elements in the xml node tree.
# note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
# each :t element is the "text" of the file. that's what we're looking for
# result is a list filled with the text of each t node in the xml document model
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
# dump result into a new .txt file
with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
    # join the elements of result together since txt.write can't take lists
    joined_result = '\n'.join(result)
    # write it into the new file
    txt.write(joined_result)
# close the zip_ref file
zip_ref.close()
# get rid of our mess of working directories
rmtree('./temp')
os.remove(zip_dir_zip_ext)

I'm sure there's a more elegant or pythonic way to accomplish this. You'll need to have the file you want to convert in the same directory as the python file. Command format is python3 docx_to_txt.py file_name.docx

nerdenator
  • 1,265
  • 2
  • 18
  • 35
0

conda install -c conda-forge python-docx

from docx import Document doc = Document(file)

for p in doc.paragrafs: print(p.text) pass

Alex Sam
  • 7
  • 2
0

Thought I would share my approach, basically boils down to two commands that convert either .doc or .docx to a string, both options require a certain package:

import docx
import os
import glob
import subprocess
import sys

# .docx (pip3 install python-docx)
doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
# .doc (apt-get install antiword)
doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")

I then wrap these solutions up in a function, that can either return the result as a python string, or write to a file (with the option of appending or replacing).

import docx
import os
import glob
import subprocess
import sys

def doc2txt(infile, outfile, return_string=False, append=False):
    if os.path.exists(infile):
        if infile.endswith(".docx"):
            try:
                doctext = "\n".join(i.text.encode("utf-8").decode("utf-8") for i in docx.Document(infile).paragraphs)
            except Exception as e:
                print("Exception in converting .docx to str: ", e)
                return None
        elif infile.endswith(".doc"):
            try:
                doctext = subprocess.check_output(["antiword", infile]).decode("utf-8")
            except Exception as e:
                print("Exception in converting .docx to str: ", e)
                return None
        else:
            print("{0} is not .doc or .docx".format(infile))
            return None

        if return_string == True:
            return doctext
        else:
            writemode = "a" if append==True else "w"
            with open(outfile, writemode) as f:
                f.write(doctext)
                f.close()
    else:
        print("{0} does not exist".format(infile))
        return None

I then would call this function via something like:

files = glob.glob("/path/to/filedir/**/*.doc*", recursive=True)
outfile = "/path/to/out.txt"
for file in files:
    doc2txt(file, outfile, return_string=False, append=True)

It's not often I need to perform this operation, but up until now the script has worked for all my needs, if you find this function has a bug let me know in a comment.

alexexchanges
  • 3,252
  • 4
  • 17
  • 38