49

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

f = open('test.doc', 'r')
f.read()

but this does not return a friendly string I need to convert it to utf-8

Edit: I just want get the text from this file

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
Italo Lemos
  • 972
  • 1
  • 9
  • 20
  • 2
    Follow the installation instructions given here. https://github.com/btimby/fulltext Before importing the module, don't forget to do 'pip install fulltext' – Siddharth Kanojiya Jul 26 '18 at 20:46

10 Answers10

53

One can use the textract library. It take care of both "doc" as well as "docx"

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

Shivam Kotwalia
  • 1,419
  • 2
  • 15
  • 20
  • 5
    Antiword does not seems to work on windows - 64 bits, any idea on that? – bones.felipe Jun 14 '17 at 17:28
  • 1
    @bones.felipe Yaa! Antiword is a Linux based command line tool. If you are on Windows 10 with Anniversary Update, will recommend you use bash on Ubuntu on windows[1], and work with Unix Commands on Windows happily! [1] http://www.windowscentral.com/how-install-bash-shell-command-line-windows-10 – Shivam Kotwalia Jun 14 '17 at 19:38
  • 1
    I'm too late, but Antiword has also [Windows version](http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/). Also there is [catdoc](https://www.wagner.pp.ru/~vitus/software/catdoc/) but it has a DOS version and does not support long filenames. – Michael Jan 12 '18 at 21:23
  • @MichaelO. Thanks! Never knew about the Windows version. Thanks again :) – Shivam Kotwalia Jan 13 '18 at 06:48
  • @ShivamKotwalia is there any way to read .doc file header and footer content. – Kiran Kumar Kotari Apr 01 '18 at 09:50
  • You can also use `catdoc filename.doc > filename.docx` which is pre-installed on Ubuntu and perhaps on other distributions too. – Robin Dinse Jul 28 '18 at 13:49
  • I have worked on textract ,(as per my knowledge ) textract dosent require antiword . textract can convert doc also (but that document sholdnt be older that 2003 or should not be in 2003 format) – yunus Nov 09 '18 at 06:02
  • 1
    @yunus - I might be wrong, but please have a look for "doc" in this currently-supported section, https://github.com/deanmalmgren/textract/blob/05fdc7a08dc3fc52eb519aefac4fcbec8981dd8e/docs/index.rst#currently-supporting – Shivam Kotwalia Nov 19 '18 at 03:36
  • 1
    Yes it supports but . please a have a look at issues too (in github itself ). To be sepecific it cannot process doc files older than 2003 format (as mentioned in previous comment) – yunus Nov 19 '18 at 10:07
  • Not sure if I'm the only one, but converting a doc to docx with antiword in this way does not work. There appears to be a `-x` flag for outputting XML, but that doesn't seem to be supported for doc files. – Ben Jul 09 '19 at 17:17
  • I strongly suggest [WSL](https://learn.microsoft.com/en-us/windows/wsl/install) for Windows users. – armanexplorer Feb 28 '23 at 22:01
35

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.

You can install it by running: pip install docx2txt.

Let's download and read the first Microsoft document on here:

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

Here is a screenshot of the Terminal output the above code:

enter image description here

EDIT:

This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
27

I was trying to do the same, and I found lots of information on reading .docx but much less on .doc ; Anyway, I managed to read the text using the following:

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

Edit:

To close everything completely, it is better to append this:

# close the document
doc.Close(False)

# quit Word
word.Quit()

Also, note that you should use absolute path for your .doc file, not the relative one. So use this to get the absolute path:

import os

# for example, ``rel_path`` could be './myfile.doc'
full_path = os.path.abspath(rel_path)
armanexplorer
  • 93
  • 2
  • 5
10SecTom
  • 2,484
  • 4
  • 22
  • 26
  • 4
    Upvote. This is the only native solution to work with anaconda3, no extra installs. Can this be done for pure `.ppt` files as well? I tried `word = win32com.client.Dispatch("PowerPoint.Application")` but got some errors. – bmc Oct 24 '18 at 12:24
  • 1
    Yes it windows only – 10SecTom Mar 23 '19 at 08:21
  • This didn't seem to work great for me. It only retrieved some of the text and couldn't read the file unless it had a very simple filepath (e.g. dashes in the filepath seemed to cause problems) – wordsforthewise Jan 31 '21 at 04:13
  • 3
    For this solution to work, the installed Word has to be able to open the document. New versions of word do not open old doc files by default. In order to make Word open them do the following in Word: `File` -> `Options` -> `Trust Center` -> `Trust Center Options` -> `File Block Settings` and then uncheck the files types you want to open – Charalamm May 08 '21 at 16:00
  • This solution did not read numbered lists, just paragraph text. – Kenney Jan 25 '22 at 18:04
  • where is the win32 com client documents? – Lei Yang May 05 '23 at 07:50
12

The answer from Shivam Kotwalia works perfectly. However, the object is imported as a byte type. Sometimes you may need it as a string for performing REGEX or something like that.

I recommend the following code (two lines from Shivam Kotwalia's answer) :

import textract

text = textract.process("path/to/file.extension")
text = text.decode("utf-8") 

The last line will convert the object text to a string.

lucas F
  • 321
  • 3
  • 5
7

I agree with Shivam's answer except for textract doesn't exist for windows. And, for some reason antiword also fails to read the '.doc' files and gives an error:

'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.

So, I've got the following workaround to extract the text:

from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text

This script will work with most kinds of files. Have fun!

Rahul Nimbal
  • 525
  • 7
  • 11
  • 2
    This did not work in my case because of unknown text encoding. I tried various ones also using `chardet`, but to no avail. – Robin Dinse Aug 04 '19 at 08:21
  • Please refer to this [link](https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file?newreg=04c5fc89c9aa49f8afe81e81147f5121) – Rahul Nimbal Aug 05 '19 at 10:01
5

Prerequisites :

install antiword : sudo apt-get install antiword

install docx : pip install docx

from subprocess import Popen, PIPE

from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
    cmd = ['antiword', file_path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    return stdout.decode('ascii', 'ignore')

print document_to_text('your_file_name','your_file_path')

Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx

Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
Aslam Shaik
  • 1,629
  • 1
  • 11
  • 10
3

I looked for solution so long. Materials about .doc file is not enough, finally I solved this problem by changing type .doc to .docx

from win32com import client as wc
w = wc.Dispatch('Word.Application')
# Or use the following method to start a separate process:
# w = wc.DispatchEx('Word.Application')
doc=w.Documents.Open(os.path.abspath('test.doc'))
doc.SaveAs("test_docx.docx",16)
zzhapar
  • 107
  • 3
0

I had to do the same to search through a ton of *.doc files for a specific number and came up with:

special_chars = {
    "b'\\t'": '\t',
    "b'\\r'": '\n',
    "b'\\x07'": '|',
    "b'\\xc4'": 'Ä',
    "b'\\xe4'": 'ä',
    "b'\\xdc'": 'Ü',
    "b'\\xfc'": 'ü',
    "b'\\xd6'": 'Ö',
    "b'\\xf6'": 'ö',
    "b'\\xdf'": 'ß',
    "b'\\xa7'": '§',
    "b'\\xb0'": '°',
    "b'\\x82'": '‚',
    "b'\\x84'": '„',
    "b'\\x91'": '‘',
    "b'\\x93'": '“',
    "b'\\x96'": '-',
    "b'\\xb4'": '´'
}


def get_string(path):
    string = ''
    with open(path, 'rb') as stream:
        stream.seek(2560) # Offset - text starts after byte 2560
        current_stream = stream.read(1)
        while not (str(current_stream) == "b'\\xfa'"):
            if str(current_stream) in special_chars.keys():
                string += special_chars[str(current_stream)]
            else:
                try:
                    char = current_stream.decode('UTF-8')
                    if char.isalnum():
                        string += char
                except UnicodeDecodeError:
                    string += ''
            current_stream = stream.read(1)
    return string

I'm not sure how 'clean' this solution is, but it works well with regex.

Viktor
  • 32
  • 4
0

This code will run when if you are looking for how to read the doc file in python install the all related packages first and see the result.

if doc_file:

    _file=requests.get(request.values['MediaUrl0'])

    doc_file_link=BytesIO(_file.content)

    file_path=os.getcwd()+'\+data.doc'

    E=open(file_path,'wb')
    E.write(doc_file_link.getbuffer())
    E.close()

    word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())
    doc = word.Documents.Open(file_path)
    doc.Activate()
    doc_data=doc.Range().Text
    print(doc_data)
    doc.Close(False)

    if os.path.exists(file_path):
       os.remove(file_path)
-1

!pip install python-docx

import docx

#Creating a word file object
doc = open("file.docx","rb")

#creating word reader object
document = docx.Document(doc)
General Grievance
  • 4,555
  • 31
  • 31
  • 45