I'm building a program that searches through many different file formats looking for specified keywords and I'm having some issues finding information on how to read .doc files. I was able to get .docx files to work with the following in a function:
def docx_file(root: str, file: str, keywords: list) -> list:
hits = []
keywords = keywords.copy()
try:
document = zipfile.ZipFile(os.path.join(root,file))
docXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml()
for keyword in keywords:
if bool(re.search(keyword, docXml)):
hits.append(keyword)
except:
logger.warning(f"failed to open {root}\\{file}")
return []
return hits
Unfortunately, .doc files don't work the same way. I'm looking for a lightweight solution, and would really prefer not to import additional libraries for this functionality.
Thank you in advance for your assistance.
I took a look through the output of the following:
document = zipfile.ZipFile(os.path.join(root,file))
document.filelist
I then ran:
for doc in document.filelist:
docXml = xml.dom.minidom.parseString(document.read(doc.filename)).toprettyxml(indent=" ")
print(docXml)
Based on the output, I don't think the .docx solution will work with .doc files.
Edit: Additionally I'm currently looking into how to possibly use the output from.
document = open(os.path.join(root,file), 'rb')
for line in document.readlines():
print(line)
I've tried decoding the output, which just gives:
☻☺àùOh«
+'³Ù0l☺◄☺↑Microsoft Office Word@@€ћs¶юШ☺@€ћs¶юШ☺♥☻♥♥.♥юя
for any encoding I've tried so far.