reading docx with python2.7

Question

I'm trying to read a docx file with file with the following code:

from docx import Document
doc = Document('test.docx')

But when I try to print it, i get this:

<docx.api.Document object at 0x02952C70>

How can I read the content inside the file?

I read that docx changed recently so, the old questions/answers don't apply anymore.

Yes, I already checked the docs, but I didn't see a function about paragraphs. I just saw the sections function whci also returns a similar hexadecimal code. — user3511563, Jul 23 '14 at 03:48

SebasSBM · Answer 1 · 2014-08-10T13:03:36.093

4

Check out the structure of the Document object here:

Source code for docx.api

For example, if you want to get the property "paragraphs":

doc = Document('test.docx')
paragraphs = doc.paragraphs()

I hope this will help.

EDIT: I have found this snippet in the python-docx's gitHub repository and edited it a little here:

document = docx.Document(filename)
docText = '\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
print docText

The join() function receives a list of strings encoded in UTF-8 from the paragraphs in the array returned by paragraphs property. So the result would look like:

paragraph 1

paragraph 2

paragraph 3

It looks like this works, but it doesn't print tables, headers or footers.

EDIT: This link is the main index for all documentation about python-docx:

python-docx 0.7.4 documentation

edited Aug 10 '14 at 13:03

answered Jul 23 '14 at 03:51

SebasSBM

860
2
8
32

Ok, but I'm still getting hexadecimal codes, and not pure text – user3511563 Jul 23 '14 at 10:12
I've edited the answer apporting new information and a code snippet. Check it out. – SebasSBM Jul 23 '14 at 15:47
thanks, but I'm getting problems with special characters (e.g. instead of getting são, I'm getting s├úo). – user3511563 Jul 25 '14 at 01:57
Maybe, in your case, you don't need to encode - so it would be just paragraph.text without the encode() method, or you need to encode it using the encoding that your software uses - please forgive the repetition. – SebasSBM Jul 28 '14 at 15:27

score 0 · Accepted Answer · edited May 23 '17 at 12:02

It is possible to not use the docx module to extract information from Word files using Python. One solution, (there are many), from etienne is a very basic version of docx which may remove the hexadecimal numbers that you are getting. However, like SebasSBM's answer, it won't work for other features, such as tables etc.

If that still doesn't work, I would suggest looking at these answers; maybe one of them will still be relevant to your new docx format.

reading docx with python2.7

2 Answers2

Linked