Python - grab all text in .docx and dump into .txt

Question

I am wondering how I would write a Python script to carry out the following set of steps: (1) open a typical .docx, (2) select all, (3) copy to clipboard, (4) store as a string.

I don't care about preserving any formatting, nor about graphics, nor about tables. I just want the text stored as a gigantic string, for parsing and analysis.

there is a package called docx2txt that can read from word document. Have you tried that? — Shenan, Oct 29 '19 at 19:13

score 1 · Answer 1 · answered Oct 29 '19 at 19:18

Since you are talking about a docx you could consider using python-docx https://python-docx.readthedocs.io/en/latest/

According to the documentation you could write something like this

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

To get all the text then using something like pyperclip you could copy it to clipboard. So without trying it i would imagine something like

import docx
import pyperclip

textInFile = getText("yourDoc.docx")
pyperclip.copy(textInFile)

https://github.com/asweigart/pyperclip

score 0 · Answer 2 · answered Oct 29 '19 at 19:16

There are libraries to help with this. Take a look at python-docx, which despite being oriented towards creating and updating docx files will allow you to read the contents of a document.

This answer HERE might help you start, but is by no means complete.

Here's a link to the python-docx documentation.

Python - grab all text in .docx and dump into .txt

2 Answers2