I'm working on a project in which I need to read the text from multiple doc and docx files. The docx files were easily done with the docx2txt module but I cannot for the love of me make it work for doc files. I've tried with textract, but it doesn't seem to work on Windows. I just need the text in the file, no pictures or anything like that. Any ideas?
Asked
Active
Viewed 686 times
2
-
Does this answer your question? [Read .doc file with python](https://stackoverflow.com/questions/36001482/read-doc-file-with-python) – user202729 Jun 09 '20 at 12:10
-
It is not easy to do. `textract` can do it if you have antiword installed. Tika can extract the text, but not the formatting. – erip Jun 09 '20 at 12:10
2 Answers
0
I found that this seems to work:
import win32com.client
text = win32com.client.Dispatch("Word.Application")
text.visible = False
wb = text.Documents.Open("myfile.doc")
document = text.ActiveDocument
print(document.Range().Text)

Kehinde
- 16
- 3
0
I had a similar issue, the following function worked for me.
def get_string(path: Path) -> str:
string = ''
with open(path, 'rb') as stream:
stream.seek(2560)
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\x00'"):
if str(current_stream) in special_chars.keys():
string += special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum() or char == ' ':
string += char
except UnicodeDecodeError:
string += ''
current_stream = stream.read(1)
return string
I tested it on a .doc file looking like the following: picture of .doc file
The output from:
string = get_string(filepath)
print(string)
is:
The big red fox jumped over the small barrier to get to the chickens on the other side
And the chickens ran about but had no luck in surviving the day
this||||that||||The other||||

Bemofresh
- 1
- 1