0

I am wondering how I would write a Python script to carry out the following set of steps: (1) open a typical .docx, (2) select all, (3) copy to clipboard, (4) store as a string.

I don't care about preserving any formatting, nor about graphics, nor about tables. I just want the text stored as a gigantic string, for parsing and analysis.

ignoramus
  • 13
  • 3

2 Answers2

1

Since you are talking about a docx you could consider using python-docx https://python-docx.readthedocs.io/en/latest/

According to the documentation you could write something like this

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

To get all the text then using something like pyperclip you could copy it to clipboard. So without trying it i would imagine something like

import docx
import pyperclip

textInFile = getText("yourDoc.docx")
pyperclip.copy(textInFile)

https://github.com/asweigart/pyperclip

jstuartmilne
  • 4,398
  • 1
  • 20
  • 30
0

There are libraries to help with this. Take a look at python-docx, which despite being oriented towards creating and updating docx files will allow you to read the contents of a document.

This answer HERE might help you start, but is by no means complete.

Here's a link to the python-docx documentation.

JavierCastro
  • 318
  • 2
  • 8