Well, that python-magic
library in the question linked in the comments looks like a pretty straight-forward solution.
Nevertheless, I'll give a more manual option. According to this site, DOC files have a signature of D0 CF 11 E0 A1 B1 1A E1
(8 bytes), while DOCX files have 50 4B 03 04
(4 bytes). Both have an offset of 0. It's safe to assume that the files are little-endian since they're from Microsoft (though, maybe Office files are Big Endian on Macs? I'm not sure)
You can unpack the binary data using the struct
module like so:
>>> with open("foo.doc", "rb") as h:
... buf = h.read()
>>> byte = struct.unpack_from("<B", buf, 0)[0]
>>> print("{0:x}".format(byte))
d0
So, here we unpacked the first little-endian ("<") byte ("B") from a buffer containing the binary data read from the file, at an offset of 0 and we found "D0", the first byte in a doc file. If we set the offset to 1, we get CF, the second byte.
Let's check if it is, indeed, a DOC file:
def is_doc(file):
with open(file, 'rb') as h:
buf = h.read()
fingerprint = []
if len(buf) > 8:
for i in range(8):
byte = struct.unpack_from("<B", buf, i)[0]
fingerprint.append("{0:x}".format(byte))
if ' '.join(fingerprint).upper() == "D0 CF 11 E0 A1 B1 1A E1":
return True
return False
>>> is_doc("foo.doc")
True
Unfortunately I don't have any DOCX files to test on but the process should be the same, except you only get the first 4 bytes and you compare against the other fingerprint.