Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

Question

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text

Any ideas how I can extract content from uploaded Word File?

      > upload = widgets.FileUpload() 
        > upload
#I select the file I want to upload
        > upload.value #Returns coded text 
        > upload.data #Returns coded text

        > #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0

Roland Smith · Accepted Answer · 2020-04-06T20:08:57.840

1

Modern ms-word files (.docx) are actually zip-files.

The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.

The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.

>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')

>>> fullText = []
>>> for paragraph in doc.paragraphs:
...     fullText.append(paragraph.text)
...

Note that this will only extract the text from paragraphs. Not e.g. the text from tables.

Edit:

I want to be able to upload the MS file through the FileUpload widget.

There are a couple of ways you can do that.

First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:

rawdata = upload.data[0]

(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)

write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.

Like this:

foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

edited Apr 06 '20 at 20:08

answered Apr 05 '20 at 19:50

Roland Smith

42,427
3
64
94

roland Hello, many thanks for you response. I have used the import docx method u suggested and it works fine. However this works if you are importing your own MS Word file. I want to be able to upload the MS file through the FileUpload widget. Any ideas on that? – Marc Henry Saad Apr 05 '20 at 21:03
I think your methods are correct however not to upload.data. I've tried both methods and non seem to work. Mainly because upload.data is a zipped list.xml and upload.value is a dictionary. I've tried several methods however all errors come down to the list/dict issue. It is not actually a bytes file. Sample output of upload.data : **[b'PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\xdf\xa4\xd2lZ\x01\x00\x00 \x05\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00** – Marc Henry Saad Apr 06 '20 at 08:13
@MarcHenrySaad See updated answer. Apperently the format of `data` has been subject to change. – Roland Smith Apr 06 '20 at 19:07
Finally worked! Many thanks for your help. I added the final code as a new answer. FYI, ['content'] is throwing an error as upload.data is the content itself so no need to call content anymore. – Marc Henry Saad Apr 06 '20 at 19:48

score 0 · Answer 2 · answered Apr 06 '20 at 19:47

Tweaking with @Roland Smith great suggestions, following code finally worked:

import io
import docx
from docx import Document

    upload = widgets.FileUpload()
    upload

    rawdata = upload.data[0]
    test = io.BytesIO(rawdata)
    doc = Document(test)

    for p in doc.paragraphs:
        print (p.text)

Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

2 Answers2

Linked