What I have as input: docx document raw bytes in byte64 format.
What I am trying to achieve: extract text from this document for further processing.
I tried to follow this answer: extracting text from MS word files in python
My code fragment:
base64_bytes = input.encode('utf-8')
decoded_data = base64.decodebytes(base64_bytes)
document = Document(decoded_data)
docText = '\n\n'.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])
The document = Document(decoded_data)
line gives me the following error: AttributeError: 'bytes' object has no attribute 'seek'
The decoded_data
is in the following format: b'PK\\x03\\x04\\x14\\x00\\x08\\x08\\x08\\x00\\x87@CP\\x00...
How should I format the raw data to extract text from docx?