extracting text from MS word files in python

Question

for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?

Can you define "working with"? Reading only, or writing too? — Mawg says reinstate Monica, Nov 09 '18 at 09:55

score 33 · Answer 1 · edited Feb 06 '20 at 11:37

33

Use the native Python docx module. Here's how to extract all the text from a doc:

document = docx.Document(filename)
docText = '\n\n'.join(
    paragraph.text for paragraph in document.paragraphs
)
print(docText)

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don't do it!

edited Feb 06 '20 at 11:37

Masklinn

34,759
3
38
57

answered Dec 30 '09 at 12:17

mikemaccana

110,530
99
389
494

1

are you doing `from docx import *` here? if not, how are you getting `getdocumenttext`, etc.? – abcd May 07 '15 at 22:50
1

`opendocx` is not in the module (perhaps it was in 2009). Documents are opened through the Document class, e.g. `import docx; document = docx.Document('Hello world.docx')`. – egpbos Aug 05 '15 at 05:40
@egpbos I've updated the example code to use the newer generation python-docx. – mikemaccana Aug 05 '15 at 10:45
3

This code resulted in an error for me: paragraph.text.encode('utf-8') for paragraph in document.paragraphs TypeError: sequence item 0: expected str instance, bytes found – MyopicVisage Nov 02 '15 at 04:10
@MyopicVisage check the official site - it's possible the latest version has a different signature. – mikemaccana Nov 02 '15 at 11:46
here i am having a scenario of getting the docx file from the http url for this how can i done this part – Joyson Feb 15 '18 at 06:50
i have created the question at https://stackoverflow.com/questions/48800385/how-to-create-ms-word-docx-file-in-python-with-raw-data?noredirect=1#comment84603535_48800385 – Joyson Feb 15 '18 at 07:01
From the Python DocX site: install with `pip install python-docx` (i.e. not just `docx`) – PangolinPaws Jul 27 '23 at 08:05

score 22 · Accepted Answer · answered Sep 24 '08 at 04:13

22

You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.

answered Sep 24 '08 at 04:13

John Fouhy

41,203
19
62
77

1

antiword can convert word documents to DocBook XML, which will preserve (at least some) formatting. – Marius Gedminas Sep 30 '15 at 11:38

score 20 · Answer 3 · edited Jul 12 '18 at 04:52

20

benjamin's answer is a pretty good one. I have just consolidated...

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
print(cleaned)

edited Jul 12 '18 at 04:52

Worm

1,313
2
11
28

answered Dec 28 '09 at 03:39

Chad

201
2
2

3

I should reiterate this only works for docx (Word 2007 or later). For .doc files wvware is your best bet. Depending on your environment it can be a pain to setup, but it does do a very nice job. – Chad Dec 28 '09 at 03:41
3

To remove XML entities like from 'text': >>>from xml.sax.saxutils import unescape >>>text=unescape(cleaned) – Jesvin Jose Aug 01 '11 at 08:06
1

content = docx.read('word/document.xml').decode('utf-8') otherwise you will get error while cleaning: TypeError: cannot use a string pattern on a bytes-like object – me_astr Oct 10 '17 at 07:38

score 11 · Answer 4 · answered Sep 24 '08 at 03:23

11

OpenOffice.org can be scripted with Python: see here.

Since OOo can load most MS Word files flawlessly, I'd say that's your best bet.

answered Sep 24 '08 at 03:23

Dan Lenski

76,929
13
76
124

10

Not flawlessly. Close, but far from flawless in my experience (OO 2.0 - 3.0). – SpliFF May 26 '09 at 15:17
6

As flawless as MS Word N+1 opens MS Words N files, and way better than MS Word N+1 opens MS Words N-1 files, IMHO – Esteban Küber Sep 29 '09 at 14:50

score 7 · Answer 5 · answered Jan 01 '09 at 01:14

I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:

http://wvware.sourceforge.net/

After installing the library, using it in Python is pretty easy:

import commands

exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)

And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.

Hopefully this will help anyone having similar issues in the future.

score 4 · Answer 6 · answered May 16 '12 at 11:35

4

Unoconv might also be a good alternative: http://linux.die.net/man/1/unoconv

answered May 16 '12 at 11:35

fccoelho

6,012
10
55
67

score 4 · Answer 7 · edited May 23 '17 at 12:34

4

Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:

However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.

edited May 23 '17 at 12:34

Community

1
1

answered Sep 24 '08 at 03:17

Swati

50,291
4
36
53

Not just that though! Even the most basic text saved in the Word 97 format is nearly impossible to get at easily without relying on word to do it for you (COM). Most word documents are not HTML! – William Keller Sep 24 '08 at 03:30
Abiword doesn't assume that it's a HTML document, and considering how extensive the tool is...I don't think it was "easy" to implement it. Abiword is a tool that helps you to read MS Word files...and since the author is concerned with text retrieval, this suffices. – Swati Sep 24 '08 at 03:42
Ah, I'd always thought that abiword was just another word processor! Man, that would have saved me some headaches awhile back. – William Keller Sep 24 '08 at 12:11

score 4 · Answer 8 · edited May 23 '17 at 12:10

(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

So that's:

unzip -p file.docx: -p == "unzip to stdout"

grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)

sed 's/<[^<]>//g'*: Remove everything inside tags

grep -v '^[[:space:]]$'*: Remove blank lines

There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.

As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

score 4 · Answer 9 · answered Nov 12 '09 at 16:18

If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.

content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
    if item.orig_filename == 'word/document.xml':
        content = docx.read(item.orig_filename)

    else:
        pass

Your content string however needs to be cleaned up, one way of doing this is:

# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
    if '>' in item:
        bad_good = item.split('>')
        if bad_good[-1] != '':
            fullyclean.append(bad_good[-1])
        else:
            pass
    else:
        pass

# Assemble a new string with all pure content
content = " ".join(fullyclean)

But there is surely a more elegant way to clean up the string, probably using the re module. Hope this helps.

To remove XML entities like from 'text': >>>from xml.sax.saxutils import unescape >>>text=unescape(content) — Jesvin Jose, Aug 01 '11 at 08:05
Using re module, the cleaning can be a lot easier: `stripped_content = re.compile(b'<.*?>').sub(b' ', content ) # strip tags` One thing I could not understand in your code is, in the former snippet why aren't you `break`ing out inside the `if` block? — Vikas Prasad, Sep 04 '15 at 18:13

Antoine Dusséaux · Answer 10 · 2016-08-05T07:28:31.260

To read Word 2007 and later files, including .docx files, you can use the python-docx package:

from docx import Document
document = Document('existing-document-file.docx')
document.save('new-file-name.docx')

To read .doc files from Word 2003 and earlier, make a subprocess call to antiword. You need to install antiword first:

sudo apt-get install antiword

Then just call it from your python script:

import os
input_word_file = "input_file.doc"
output_text_file = "output_file.txt"
os.system('antiword %s > %s' % (input_word_file, output_text_file))

score 3 · Answer 11 · edited Jun 21 '16 at 06:32

3

I'm not sure if you're going to have much luck without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!

At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!

edited Jun 21 '16 at 06:32

Steve Barnes

27,618
6
63
73

answered Sep 24 '08 at 03:19

William Keller

5,256
1
25
22

score 3 · Answer 12 · edited May 23 '17 at 12:10

3

If you have LibreOffice installed, you can simply call it from the command line to convert the file to text, then load the text into Python.

edited May 23 '17 at 12:10

Community

1
1

answered May 08 '15 at 11:31

markling

1,232
1
15
28

1

Ah Philip! I was just looking for a way to reject the trivial edits of style you made to another post of mine. I tried to contact you directly. Would you please state more clearly what you are suggesting here? This answer I gave here is in answer to the question. Isn't that good enough? – markling May 08 '15 at 14:55
1

Re. your edits of style and grammar: I preferred my own style and grammar, thank you. A good editor doesn't impose his own style. And really, none of us have enough spare time to be doing trivial spell and grammer checking, do we? I think you may find it is a little over-bearing. – markling May 08 '15 at 15:46

Dalen · Answer 13 · 2015-06-01T21:07:18.500

Is this an old question? I believe that such thing does not exist. There are only answered and unanswered ones. This one is pretty unanswered, or half answered if you wish. Well, methods for reading *.docx (MS Word 2007 and later) documents without using COM interop are all covered. But methods for extracting text from *.doc (MS Word 97-2000), using Python only, lacks. Is this complicated? To do: not really, to understand: well, that's another thing.

When I didn't find any finished code, I read some format specifications and dug out some proposed algorithms in other languages.

MS Word (*.doc) file is an OLE2 compound file. Not to bother you with a lot of unnecessary details, think of it as a file-system stored in a file. It actually uses FAT structure, so the definition holds. (Hm, maybe you can loop-mount it in Linux???) In this way, you can store more files within a file, like pictures etc. The same is done in *.docx by using ZIP archive instead. There are packages available on PyPI that can read OLE files. Like (olefile, compoundfiles, ...) I used compoundfiles package to open *.doc file. However, in MS Word 97-2000, internal subfiles are not XML or HTML, but binary files. And as this is not enough, each contains an information about other one, so you have to read at least two of them and unravel stored info accordingly. To understand fully, read the PDF document from which I took the algorithm.

Code below is very hastily composed and tested on small number of files. As far as I can see, it works as intended. Sometimes some gibberish appears at the start, and almost always at the end of text. And there can be some odd characters in-between as well.

Those of you who just wish to search for text will be happy. Still, I urge anyone who can help to improve this code to do so.


doc2text module:
"""
This is Python implementation of C# algorithm proposed in:
http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf

Python implementation author is Dalen Bernaca.
Code needs refining and probably bug fixing!
As I am not a C# expert I would like some code rechecks by one.
Parts of which I am uncertain are:
    * Did the author of original algorithm used uint32 and int32 when unpacking correctly?
      I copied each occurence as in original algo.
    * Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not?
    * Did I interpret each C# command correctly?
      I think I did!
"""

from compoundfiles import CompoundFileReader, CompoundFileError
from struct import unpack

__all__ = ["doc2text"]

def doc2text (path):
    text = u""
    cr = CompoundFileReader(path)
    # Load WordDocument stream:
    try:
        f = cr.open("WordDocument")
        doc = f.read()
        f.close()
    except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all."
    # Extract file information block and piece table stream informations from it:
    fib = doc[:1472]
    fcClx  = unpack("L", fib[0x01a2l:0x01a6l])[0]
    lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0]
    tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l
    tableName = ("0Table", "1Table")[tableFlag]
    # Load piece table stream:
    try:
        f = cr.open(tableName)
        table = f.read()
        f.close()
    except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName
    cr.close()
    # Find piece table inside a table stream:
    clx = table[fcClx:fcClx+lcbClx]
    pos = 0
    pieceTable = ""
    lcbPieceTable = 0
    while True:
        if clx[pos]=="\x02":
            # This is piece table, we store it:
            lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0]
            pieceTable = clx[pos+5:pos+5+lcbPieceTable]
            break
        elif clx[pos]=="\x01":
            # This is beggining of some other substructure, we skip it:
            pos = pos+1+1+ord(clx[pos+1])
        else: break
    if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table."
    # Read info from pieceTable, about each piece and extract it from WordDocument stream:
    pieceCount = (lcbPieceTable-4)/12
    for x in xrange(pieceCount):
        cpStart = unpack("l", pieceTable[x*4:x*4+4])[0]
        cpEnd   = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0]
        ofsetDescriptor = ((pieceCount+1)*4)+(x*8)
        pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8]
        fcValue = unpack("L", pieceDescriptor[2:6])[0]
        isANSII = (fcValue & 0x40000000) == 0x40000000
        fc      = fcValue & 0xbfffffff
        cb = cpEnd-cpStart
        enc = ("utf-16", "cp1252")[isANSII]
        cb = (cb*2, cb)[isANSII]
        text += doc[fc:fc+cb].decode(enc, "ignore")
    return "\n".join(text.splitlines())

score 2 · Answer 14 · answered Feb 12 '13 at 09:25

2

Just an option for reading 'doc' files without using COM: miette. Should work on any platform.

answered Feb 12 '13 at 09:25

alecxe

462,703
120
1,088
1,195

score 0 · Answer 15 · answered Jan 11 '21 at 16:27

Aspose.Words Cloud SDK for Python is a platform independent solution to convert MS Word/Open Office files to text. It is a commercial product but free trial plan provides 150 monthly API calls.

P.S: I am a developer evangelist at Aspose.

# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile

# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'

filename = 'C:/Temp/02_pages.docx'
dest_name = 'C:/Temp/02_pages.txt'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='txt')
result = words_api.convert_document(request)
copyfile(result, dest_name)

extracting text from MS word files in python

15 Answers15

Linked

Related