2

In my web app (Flask) I'm letting the user upload a word document.

I check that the extension of the file is either .doc or .docx . However, I changed a .jpg file's extension to .docx and it passed as well (as I expected).

Is there a way to verify that an uploaded file is indeed a word document? I searched and read something about the header of a file but could not find any other information.

I'm using boto to upload the files to aws, in case it matters. Thanks.

Kreutzer
  • 328
  • 4
  • 12

6 Answers6

2

Well, that python-magic library in the question linked in the comments looks like a pretty straight-forward solution.

Nevertheless, I'll give a more manual option. According to this site, DOC files have a signature of D0 CF 11 E0 A1 B1 1A E1 (8 bytes), while DOCX files have 50 4B 03 04 (4 bytes). Both have an offset of 0. It's safe to assume that the files are little-endian since they're from Microsoft (though, maybe Office files are Big Endian on Macs? I'm not sure)

You can unpack the binary data using the struct module like so:

>>> with open("foo.doc", "rb") as h:
...    buf = h.read()
>>> byte = struct.unpack_from("<B", buf, 0)[0]
>>> print("{0:x}".format(byte))
d0

So, here we unpacked the first little-endian ("<") byte ("B") from a buffer containing the binary data read from the file, at an offset of 0 and we found "D0", the first byte in a doc file. If we set the offset to 1, we get CF, the second byte.

Let's check if it is, indeed, a DOC file:

def is_doc(file):
    with open(file, 'rb') as h:
        buf = h.read()
    fingerprint = []
    if len(buf) > 8:
        for i in range(8):
            byte = struct.unpack_from("<B", buf, i)[0]
            fingerprint.append("{0:x}".format(byte))
    if ' '.join(fingerprint).upper() == "D0 CF 11 E0 A1 B1 1A E1":        
        return True
    return False

>>> is_doc("foo.doc")
True

Unfortunately I don't have any DOCX files to test on but the process should be the same, except you only get the first 4 bytes and you compare against the other fingerprint.

  • Note that you may just be able to read `buf[:8]` directly. I profess ignorance as to whether or not that would behave the same on all systems. The `struct.unpack_from` method is guaranteed to work the same. –  Jul 05 '13 at 22:39
1

Docx files are actually zip files. This zip contains three basic folders: word, docProps and _rels. Thus, use zipfile to test if those files exist in this file.

import zipfile

def isdir(z, name):
   return any(x.startswith("%s/" % name.rstrip("/")) for x in z.namelist())

def isValidDocx(filename):
  f = zipfile.ZipFile(filename, "r")
  return isdir(f, "word") and isdir(f, "docProps") and isdir(f, "_rels")

Code adapted from Check if a directory exists in a zip file with Python

However, any ZIP that contains those folders will bypass the verification. I also don't know if it works for DOC or for encrypted DOCS.

Community
  • 1
  • 1
Nacib Neme
  • 859
  • 1
  • 17
  • 28
1

You can use the python-docx library

The below code will raise value error is the file is not a docx file.

from docx import Document
try:
    Document("abc.docx")
except ValueError:
    print "Not a valid document type"
tj89
  • 3,953
  • 2
  • 12
  • 12
1

python-magic does a very good job of detecting docx as well as pptx formats.

Here are a few examples:

In [60]: magic.from_file("oz123.docx")
Out[60]: 'Microsoft Word 2007+'

In [61]: magic.from_file("oz123.docx", mime=True)
Out[61]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

In [62]: magic.from_file("presentation.pptx")
Out[62]: 'Microsoft PowerPoint 2007+'

In [63]: magic.from_file("presentation.pptx", mime=True)
Out[63]: 'application/vnd.openxmlformats-officedocument.presentationml.presentation'

Since the OP asked about a file upload, a file handle isn't very useful. Luckily, magic support detecting from buffer:

In [63]: fdox
Out[63]: <_io.BufferedReader name='/home/oz123/Documents/oz123.docx'>

In [64]: magic.from_buffer(fdox.read(2048))
Out[64]: 'Zip archive data, at least v2.0 to extract

Naively, we read an amount which is too small ... Reading more bytes solves the problem:

In [65]: fdox.seek(0)
Out[65]: 0

In [66]: magic.from_buffer(fdox.read(4096))
Out[66]: 'Microsoft Word 2007+'

In [67]: fdox.seek(0)
Out[67]: 0

In [68]: magic.from_buffer(fdox.read(4096), mime=True)
Out[68]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
oz123
  • 27,559
  • 27
  • 125
  • 187
0

I used python-magic to verify whether the file type is a word document. However I met a lot of problems. Such as: the different word version or the different software was resulting in different types. So I gave up the python-magic.

Here is my solution.

DOC_MAGIC_BYTES = [
    "D0 CF 11 E0 A1 B1 1A E1",
    "0D 44 4F 43",
    "CF 11 E0 A1 B1 1A E1 00",
    "DB A5 2D 00",
    "EC A5 C1 00"
]
DOCX_MAGIC_BYTES = [
    "50 4B 03 04"
]

def validate_is_word(content):
    magic_bytes = content[:8]
    fingerprint = []
    bytes_len = len(magic_bytes)
    if bytes_len >= 4:
        for i in xrange(bytes_len):
            byte = struct.unpack_from("<B", magic_bytes, i)[0]
            fingerprint.append("{:02x}".format(byte))
    if not fingerprint:
        return False
    if is_docx_file(fingerprint):
        return True
    if is_doc_file(fingerprint):
        return True
    return False


def is_doc_file(magic_bytes):
    four_bytes = " ".join(magic_bytes[:4]).upper()
    all_bytes = " ".join(magic_bytes).upper()
    return four_bytes in DOC_MAGIC_BYTES or all_bytes in DOC_MAGIC_BYTES


def is_docx_file(magic_bytes):
    type_ = " ".join(magic_bytes[:4]).upper()
    return type_ in DOCX_MAGIC_BYTES

You can try this.

chuang wang
  • 65
  • 1
  • 6
0

I use filetype python lib to check and compare mime type with its document extension so my users can't fool me just by changing their file extension.

pip install filetype

Then

import filetype

kind = filetype.guess('path/to/file')
mime = kind.mime
ext = kind.extension

You can check their doc here

lucyjosef
  • 712
  • 1
  • 8
  • 24
  • 1
    I think `filetype` library is meant to check only images and pdf files ( as of 8/19) and other related types but not any type of word documents. I actually tried it and it returns `none` even for most valid word document. I think they'll be adding soon. – Amith Adiraju Aug 30 '19 at 00:56
  • This still library does not support docx format. – oz123 Jan 02 '21 at 21:25