Read .doc file with python

Question

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

f = open('test.doc', 'r')
f.read()

but this does not return a friendly string I need to convert it to utf-8

Edit: I just want get the text from this file

Follow the installation instructions given here. https://github.com/btimby/fulltext Before importing the module, don't forget to do 'pip install fulltext' — Siddharth Kanojiya, Jul 26 '18 at 20:46

Shivam Kotwalia · Answer 1 · 2017-04-13T16:51:38.690

53

One can use the textract library. It take care of both "doc" as well as "docx"

import textract
text = textract.process("path/to/file.extension")

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

Ultimately, textract in the backend is using antiword.

edited Apr 13 '17 at 16:51

answered Mar 31 '17 at 08:18

Shivam Kotwalia

1,419
2
15
20

5

Antiword does not seems to work on windows - 64 bits, any idea on that? – bones.felipe Jun 14 '17 at 17:28
1

@bones.felipe Yaa! Antiword is a Linux based command line tool. If you are on Windows 10 with Anniversary Update, will recommend you use bash on Ubuntu on windows[1], and work with Unix Commands on Windows happily! [1] http://www.windowscentral.com/how-install-bash-shell-command-line-windows-10 – Shivam Kotwalia Jun 14 '17 at 19:38
1

I'm too late, but Antiword has also [Windows version](http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/). Also there is [catdoc](https://www.wagner.pp.ru/~vitus/software/catdoc/) but it has a DOS version and does not support long filenames. – Michael Jan 12 '18 at 21:23
@MichaelO. Thanks! Never knew about the Windows version. Thanks again :) – Shivam Kotwalia Jan 13 '18 at 06:48
@ShivamKotwalia is there any way to read .doc file header and footer content. – Kiran Kumar Kotari Apr 01 '18 at 09:50
You can also use `catdoc filename.doc > filename.docx` which is pre-installed on Ubuntu and perhaps on other distributions too. – Robin Dinse Jul 28 '18 at 13:49
I have worked on textract ,(as per my knowledge ) textract dosent require antiword . textract can convert doc also (but that document sholdnt be older that 2003 or should not be in 2003 format) – yunus Nov 09 '18 at 06:02
1

@yunus - I might be wrong, but please have a look for "doc" in this currently-supported section, https://github.com/deanmalmgren/textract/blob/05fdc7a08dc3fc52eb519aefac4fcbec8981dd8e/docs/index.rst#currently-supporting – Shivam Kotwalia Nov 19 '18 at 03:36
1

Yes it supports but . please a have a look at issues too (in github itself ). To be sepecific it cannot process doc files older than 2003 format (as mentioned in previous comment) – yunus Nov 19 '18 at 10:07
Not sure if I'm the only one, but converting a doc to docx with antiword in this way does not work. There appears to be a `-x` flag for outputting XML, but that doesn't seem to be supported for doc files. – Ben Jul 09 '19 at 17:17
I strongly suggest [WSL](https://learn.microsoft.com/en-us/windows/wsl/install) for Windows users. – armanexplorer Feb 28 '23 at 22:01

Billal Begueradj · Answer 2 · 2018-07-23T07:17:36.427

35

You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.

You can install it by running: pip install docx2txt.

Let's download and read the first Microsoft document on here:

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

Here is a screenshot of the Terminal output the above code:

EDIT:

This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.

edited Jul 23 '18 at 07:17

answered Mar 15 '16 at 07:04

Billal Begueradj

20,717
43
112
130

27

Unfortunately only .docx files are read by docx2txt I only have .doc files – Italo Lemos Mar 15 '16 at 11:38
10

Question is about reading .doc files. This works only for .docx @billal-begueradj – Kiran Kumar Kotari Apr 01 '18 at 09:52
4

You are right, it does not work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files. @HarishMashetty – Billal Begueradj Jul 23 '18 at 06:23
This does not provide the answer to the question, just not that format. – Audrius Meškauskas Nov 28 '19 at 16:46
1

@h22 thanks, but there are comments like that which dates back to 1 year ago, and I responded to them by editing the post – Billal Begueradj Nov 30 '19 at 13:41
1

You can save the file as docx using zzhapar's solution then this method will work. – Kenney Jan 25 '22 at 18:03

score 27 · Answer 3 · edited Mar 01 '23 at 11:12

27

I was trying to do the same, and I found lots of information on reading .docx but much less on .doc ; Anyway, I managed to read the text using the following:

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

Edit:

To close everything completely, it is better to append this:

# close the document
doc.Close(False)

# quit Word
word.Quit()

Also, note that you should use absolute path for your .doc file, not the relative one. So use this to get the absolute path:

import os

# for example, ``rel_path`` could be './myfile.doc'
full_path = os.path.abspath(rel_path)

edited Mar 01 '23 at 11:12

armanexplorer

93
2
5

answered Jun 11 '18 at 10:54

10SecTom

2,484
4
22
26

4

Upvote. This is the only native solution to work with anaconda3, no extra installs. Can this be done for pure `.ppt` files as well? I tried `word = win32com.client.Dispatch("PowerPoint.Application")` but got some errors. – bmc Oct 24 '18 at 12:24
1

Yes it windows only – 10SecTom Mar 23 '19 at 08:21
This didn't seem to work great for me. It only retrieved some of the text and couldn't read the file unless it had a very simple filepath (e.g. dashes in the filepath seemed to cause problems) – wordsforthewise Jan 31 '21 at 04:13
3

For this solution to work, the installed Word has to be able to open the document. New versions of word do not open old doc files by default. In order to make Word open them do the following in Word: `File` -> `Options` -> `Trust Center` -> `Trust Center Options` -> `File Block Settings` and then uncheck the files types you want to open – Charalamm May 08 '21 at 16:00
This solution did not read numbered lists, just paragraph text. – Kenney Jan 25 '22 at 18:04
where is the win32 com client documents? – Lei Yang May 05 '23 at 07:50

score 12 · Answer 4 · answered Nov 08 '19 at 18:02

12

The answer from Shivam Kotwalia works perfectly. However, the object is imported as a byte type. Sometimes you may need it as a string for performing REGEX or something like that.

I recommend the following code (two lines from Shivam Kotwalia's answer) :

import textract

text = textract.process("path/to/file.extension")
text = text.decode("utf-8")

The last line will convert the object text to a string.

answered Nov 08 '19 at 18:02

lucas F

321
3
5

Yeah, but i don't think the native text encoding for .doc files is UTF-8, is it? – CpILL Oct 13 '22 at 23:05

Rahul Nimbal · Answer 5 · 2019-06-14T05:59:08.683

7

I agree with Shivam's answer except for textract doesn't exist for windows. And, for some reason antiword also fails to read the '.doc' files and gives an error:

'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.

So, I've got the following workaround to extract the text:

from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text

This script will work with most kinds of files. Have fun!

edited Jun 14 '19 at 05:59

answered Jun 14 '19 at 05:53

Rahul Nimbal

525
7
11

2

This did not work in my case because of unknown text encoding. I tried various ones also using `chardet`, but to no avail. – Robin Dinse Aug 04 '19 at 08:21
Please refer to this [link](https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file?newreg=04c5fc89c9aa49f8afe81e81147f5121) – Rahul Nimbal Aug 05 '19 at 10:01

score 5 · Answer 6 · edited Jul 23 '18 at 07:18

Prerequisites :

install antiword : sudo apt-get install antiword

install docx : pip install docx

from subprocess import Popen, PIPE

from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
    cmd = ['antiword', file_path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    return stdout.decode('ascii', 'ignore')

print document_to_text('your_file_name','your_file_path')

Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx

score 3 · Answer 7 · answered Jul 28 '21 at 08:36

I looked for solution so long. Materials about .doc file is not enough, finally I solved this problem by changing type .doc to .docx

from win32com import client as wc
w = wc.Dispatch('Word.Application')
# Or use the following method to start a separate process:
# w = wc.DispatchEx('Word.Application')
doc=w.Documents.Open(os.path.abspath('test.doc'))
doc.SaveAs("test_docx.docx",16)

score 0 · Answer 8 · answered Jul 07 '21 at 09:25

I had to do the same to search through a ton of *.doc files for a specific number and came up with:

special_chars = {
    "b'\\t'": '\t',
    "b'\\r'": '\n',
    "b'\\x07'": '|',
    "b'\\xc4'": 'Ä',
    "b'\\xe4'": 'ä',
    "b'\\xdc'": 'Ü',
    "b'\\xfc'": 'ü',
    "b'\\xd6'": 'Ö',
    "b'\\xf6'": 'ö',
    "b'\\xdf'": 'ß',
    "b'\\xa7'": '§',
    "b'\\xb0'": '°',
    "b'\\x82'": '‚',
    "b'\\x84'": '„',
    "b'\\x91'": '‘',
    "b'\\x93'": '“',
    "b'\\x96'": '-',
    "b'\\xb4'": '´'
}


def get_string(path):
    string = ''
    with open(path, 'rb') as stream:
        stream.seek(2560) # Offset - text starts after byte 2560
        current_stream = stream.read(1)
        while not (str(current_stream) == "b'\\xfa'"):
            if str(current_stream) in special_chars.keys():
                string += special_chars[str(current_stream)]
            else:
                try:
                    char = current_stream.decode('UTF-8')
                    if char.isalnum():
                        string += char
                except UnicodeDecodeError:
                    string += ''
            current_stream = stream.read(1)
    return string

I'm not sure how 'clean' this solution is, but it works well with regex.

Interesting, specifically about the special characters. How did you figure out that list? — Partha Mandal, Jul 05 '23 at 11:56

Nishant Verma · Answer 9 · 2022-12-27T15:27:48.557

This code will run when if you are looking for how to read the doc file in python install the all related packages first and see the result.

if doc_file:

    _file=requests.get(request.values['MediaUrl0'])

    doc_file_link=BytesIO(_file.content)

    file_path=os.getcwd()+'\+data.doc'

    E=open(file_path,'wb')
    E.write(doc_file_link.getbuffer())
    E.close()

    word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())
    doc = word.Documents.Open(file_path)
    doc.Activate()
    doc_data=doc.Range().Text
    print(doc_data)
    doc.Close(False)

    if os.path.exists(file_path):
       os.remove(file_path)

Which packages are expected to be installed here? What file is downloaded and why is it removed? — Chris Warrick, Dec 28 '22 at 10:50

score -1 · Answer 10 · edited Mar 02 '22 at 14:03

-1

!pip install python-docx

import docx

#Creating a word file object
doc = open("file.docx","rb")

#creating word reader object
document = docx.Document(doc)

edited Mar 02 '22 at 14:03

General Grievance

4,555
31
31
45

answered Feb 27 '22 at 09:19

Venkata Ramana

25
1

5

The subject in question is `.doc` extension, not `.docx` – SukiCZ Sep 06 '22 at 10:38

Read .doc file with python

10 Answers10

Linked

Related