6

What I have as input: docx document raw bytes in byte64 format.
What I am trying to achieve: extract text from this document for further processing.
I tried to follow this answer: extracting text from MS word files in python

My code fragment:

base64_bytes = input.encode('utf-8')
decoded_data = base64.decodebytes(base64_bytes)
document = Document(decoded_data)
docText = '\n\n'.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])

The document = Document(decoded_data) line gives me the following error: AttributeError: 'bytes' object has no attribute 'seek'
The decoded_data is in the following format: b'PK\\x03\\x04\\x14\\x00\\x08\\x08\\x08\\x00\\x87@CP\\x00...

How should I format the raw data to extract text from docx?

Michał Herman
  • 3,437
  • 6
  • 29
  • 43
  • `input.encode('utf-8')`. Is this your actual code? Because this is trying to encode the function object `input` as UTF-8 – clubby789 Feb 06 '20 at 11:10
  • 1) Your title says "`seek`", your question says "`code`". Which is it? 2) What exactly is `Document` and what kind of argument does it expect? – deceze Feb 06 '20 at 11:11
  • You say you are following the advise [Use the native Python docx module...](https://stackoverflow.com/a/1979906/2564301) and then -- you do *not* follow it. You do **not** need to encode, decode, or even explicitly load the file 'manually'. – Jongware Feb 06 '20 at 11:13
  • @usr2564301 they only diverge where they have to, their input is in-memory base64 content rather than a file on disk. – Masklinn Feb 06 '20 at 11:17

1 Answers1

15

From the official documentation, emphasis mine:

docx.Document(docx=None)

Return a Document object loaded from docx, where docx can be either a path to a .docx file (a string) or a file-like object. If docx is missing or None, the built-in default document “template” is loaded.

So if you provide a string or string-like parameter it is interpreted as the path to a docx file. To provide the contents from memory, you need to pass in a file-like object aka a BytesIO instance (the entire point of StringIO and BytesIO being to "convert" strings and bytes to file-like objects):

document = Document(io.BytesIO(decoded_data))

side-note: you probably want to remove the .encode call in the list comprehension, in Python 3 text (str) and bytes (bytes) are not compatible at all, so the line is going to blow up when you try to concatenate bytes (encoded text) with textual separators.

Community
  • 1
  • 1
Masklinn
  • 34,759
  • 3
  • 38
  • 57
  • 1
    Doing this, I get an `Exception: BadZipFile: File is not a zip file`. I've tried creating a ZipFile class instance with the BytesIO(decoded_data) object but I get a different error doing so. Any thoughts? – Sean Richards Mar 31 '22 at 17:05
  • That your file is not an actual docx? Possibly a "legacy" doc file? A docx is a zipfile, so internally `docx.Document` will open the zipfile and start parsing its content. – Masklinn Mar 31 '22 at 19:22
  • That's the tough part here is that I turned a known .docx file into base64 to test passing it via a JSON request body. Ingesting the base64 string, decoding it into a BytesIO instance is result in that error. Womp womp. I think I'll probably post a question. Thanks for the response! – Sean Richards Mar 31 '22 at 21:15