How to extract text from an existing docx file using python-docx

Question

I'm trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

from docx import Document

document = Document('test_doc.docx')
print(document.paragraphs)

It returned a list of <docx.text.Paragraph object at 0x... >

Then I did:

for p in document.paragraphs:
    print(p.text)

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

What is the issue? Why URLs are missing?

How could I get complete text without iterating over loop (something like open().read())

Note the old GitHub repo https://github.com/mikemaccana/python-docx has 'This Project Has Moved!' in heading 1. — mikemaccana, Aug 05 '15 at 10:43

score 71 · Answer 1 · answered Mar 08 '16 at 15:28

71

you can try this

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

answered Mar 08 '16 at 15:28

Chinmoy Panda

935
8
7

15

This is a good start, it does not reflect text in tables, in headers, in footers and in foot notes. – guerda Feb 15 '18 at 13:46
6

Consider using [simplify-docx](https://github.com/microsoft/Simplify-Docx) which based on python-docx and substantially reduces the complexity of the XML document while keeping much of the structure (paragraphs, tables, headers, footers, etc.) – Jthorpe Nov 21 '19 at 23:46
7

How is this different from what the questioner as a method? In fact it is even worse because it creates a stupid and useless list instead of a text!! And I see 59 upvotes for this!! They should actually be downvotes! (I didn't downvote because I never do. I prefer instead to explain why replies like this one are really bad!) – Apostolos Aug 25 '21 at 10:29
indeed it just confirms that the question is hard to solve – Jean-François Fabre Apr 29 '22 at 07:45
fun one-liner: `'\n'.join([p.text for p in doc.paragraphs])` – Matt Oct 07 '22 at 14:36

Ankush Shah · Answer 2 · 2016-02-17T23:08:07.267

26

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

edited Feb 17 '16 at 23:08

answered Oct 29 '15 at 02:59

Ankush Shah

938
8
13

this is a useful piece of code, but it's not export numbered lists. – robob May 07 '17 at 05:26
thanks, [here is the tracking issue for this bug](https://github.com/ankushshah89/python-docx2txt/issues/12) – Ankush Shah May 08 '17 at 02:42
Updated version is in this package https://github.com/ShayHill/docx2python – Roland Pihlakas Jan 05 '23 at 15:26

score 16 · Answer 3 · edited Aug 27 '20 at 20:04

16

Without Installing python-docx

docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

edited Aug 27 '20 at 20:04

John Smith

835
1
7
19

answered Aug 16 '17 at 04:32

imanzabet

2,752
2
26
19

I get this with your code "zipfile.BadZipFile: File is not a zip file". Why would that be? – John Smith Aug 27 '20 at 06:47
This code worked for me previously. Can you upload your docx file and provide a link that I can test it? – imanzabet Sep 12 '20 at 01:29
1

this still works, but .getiterator( has been deprecated and has to be replaced with .iter( now https://docs.python.org/3.9/whatsnew/3.9.html#removed – Ping Lu Jan 26 '23 at 09:01

scanny · Answer 4 · 2020-01-25T21:00:25.100

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.

The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.

Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.

had this been fixed in the latest version - hard to tell from github — acutesoftware, Jul 30 '15 at 03:57

score 7 · Answer 5 · answered Jun 06 '18 at 05:40

Using python-docx, as @Chinmoy Panda 's answer shows:

for para in doc.paragraphs:
    fullText.append(para.text)

However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:

def para2text(p):
    rs = p._element.xpath('.//w:t')
    return u" ".join([r.text for r in rs])

score 0 · Answer 6 · answered Feb 17 '21 at 08:59

It seems that there is no official solution for this problem, but there is a workaround posted here https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649

just update this file "...\site-packages\docx\oxml_init_.py"

# add
import re
import sys

# add
def remove_hyperlink_tags(xml):
    if (sys.version_info > (3, 0)):
        xml = xml.decode('utf-8')
    xml = xml.replace('</w:hyperlink>', '')
    xml = re.sub('<w:hyperlink[^>]*>', '', xml)
    if (sys.version_info > (3, 0)):
        xml = xml.encode('utf-8')
    return xml
    
# update
def parse_xml(xml):
    """
    Return root lxml element obtained by parsing XML character string in
    *xml*, which can be either a Python 2.x string or unicode. The custom
    parser is used, so custom element classes are produced for elements in
    *xml* that have them.
    """
    root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
    return root_element

and of course don't forget to mention in the documentation that use are changing the official library

How to extract text from an existing docx file using python-docx

6 Answers6

Linked

Related