Search .doc file for specific word with only standard Python libraries

Question

I'm building a program that searches through many different file formats looking for specified keywords and I'm having some issues finding information on how to read .doc files. I was able to get .docx files to work with the following in a function:

def docx_file(root: str, file: str, keywords: list) -> list:
    hits = []
    keywords = keywords.copy()
    try:
        document = zipfile.ZipFile(os.path.join(root,file))
        docXml = xml.dom.minidom.parseString(document.read('word/document.xml')).toprettyxml()
        for keyword in keywords:
            if bool(re.search(keyword, docXml)):
                hits.append(keyword)
    except:
        logger.warning(f"failed to open {root}\\{file}")
        return []

    return hits

Unfortunately, .doc files don't work the same way. I'm looking for a lightweight solution, and would really prefer not to import additional libraries for this functionality.

Thank you in advance for your assistance.

I took a look through the output of the following:

document = zipfile.ZipFile(os.path.join(root,file))
document.filelist

I then ran:

for doc in document.filelist:
  docXml = xml.dom.minidom.parseString(document.read(doc.filename)).toprettyxml(indent="   ")
  print(docXml)

Based on the output, I don't think the .docx solution will work with .doc files.

Edit: Additionally I'm currently looking into how to possibly use the output from.

document = open(os.path.join(root,file), 'rb')
for line in document.readlines():
  print(line)

I've tried decoding the output, which just gives:

☻☺àùOh«
+'³Ù0l☺◄☺↑Microsoft Office Word@@€ћs¶юШ☺@€ћs¶юШ☺♥☻♥♥.♥юя

for any encoding I've tried so far.

Repeating the previous comment, `doc` and `docx` formats are different. I'm no expert on the subject matter, but I found the spec if you'd like to dive in for a solution: https://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf Still, repeating the previous comment, I also recommend looking for available packages that already do that for you. — micromoses, Dec 05 '22 at 17:14

score 0 · Answer 1 · answered Dec 06 '22 at 12:36

While .docx is XML-in-a-Zip, which is pretty easy to read, the .doc files (without X) are in a binary format which is much more difficult to parse, specially without a third-party library doing exactly that.

You can implement part of a parser for this file format (based on the spec linked by @micromoses), but it may be tedious.

Another solution, much quicker, but also much less reliable, is to just grep the file. I just tested it : writing my own name in plain in a file, saving it as a .doc, and I can grep it. That's because the word is probably byte-aligned in the file format, so that it can be matched.

Here is a quick demo :

from pathlib import Path

filepath = Path("/home/stack_overflow/parse_me.doc")
word = "Pinjon"

with open(filepath, "rb") as doc_file:
    binary_data = doc_file.read()

if word.encode("utf-16-le") in binary_data:
    print("word found (LE)")
elif word.encode("utf-16-be") in binary_data:
    print("word found (BE)")
else:
    print("word not found")

word found (LE)

My computer is Little Endian but I covered both cases (LE/BE).

score 0 · Answer 2 · answered Dec 08 '22 at 14:46

I believe I found a suitable solution for my problem it's a slight variation from a snippet posted here, written by Viktor.

special_chars = {
    "b'\\t'": '\t',
    "b'\\r'": '\n',
    "b'\\x07'": '|',
    "b'\\xc4'": 'Ä',
    "b'\\xe4'": 'ä',
    "b'\\xdc'": 'Ü',
    "b'\\xfc'": 'ü',
    "b'\\xd6'": 'Ö',
    "b'\\xf6'": 'ö',
    "b'\\xdf'": 'ß',
    "b'\\xa7'": '§',
    "b'\\xb0'": '°',
    "b'\\x82'": '‚',
    "b'\\x84'": '„',
    "b'\\x91'": '‘',
    "b'\\x93'": '“',
    "b'\\x96'": '-',
    "b'\\xb4'": '´'
}

def doc_strings(path: Path) -> str:
  output_string = ''

  with open(path, 'rb') as stream:
    stream.seek(2560)
    current_stream = stream.read(1)

    while not (str(current_stream) == "b'\\xfa'"):

      if str(current_stream) in special_chars.keys():
        output_string += special_chars[str(current_stream)]

      else:
        try:
          char = current_stream.decode('UTF-8')
          if char.isalnum() or char == ' ':
            output_string += char
          except UnicodeDecodeError:
            output_string += ''

      current_stream = stream.read(1)

  return output_string

Search .doc file for specific word with only standard Python libraries

2 Answers2