44

I have an app that lets people upload files, represented as UploadedFiles. However, I want to make sure that users only upload xml files. I know I can do this using magic, but I don't know where to put this check - I can't put it in the clean function since the file is not yet uploaded when clean runs, as far as I can tell.

Here's the UploadedFile model:

class UploadedFile(models.Model):
    """This represents a file that has been uploaded to the server."""
    STATE_UPLOADED = 0
    STATE_ANNOTATED = 1
    STATE_PROCESSING = 2
    STATE_PROCESSED = 4
    STATES = (
        (STATE_UPLOADED, "Uploaded"),
        (STATE_ANNOTATED, "Annotated"),
        (STATE_PROCESSING, "Processing"),
        (STATE_PROCESSED, "Processed"),
    )

    status = models.SmallIntegerField(choices=STATES,
        default=0, blank=True, null=True) 
    file = models.FileField(upload_to=settings.XML_ROOT)
    project = models.ForeignKey(Project)

    def __unicode__(self):
        return self.file.name

    def name(self):
        return os.path.basename(self.file.name)

    def save(self, *args, **kwargs):
        if not self.status:
            self.status = self.STATE_UPLOADED
        super(UploadedFile, self).save(*args, **kwargs)

    def delete(self, *args, **kwargs):
        os.remove(self.file.path)
        self.file.delete(False)
        super(UploadedFile, self).delete(*args, **kwargs)

    def get_absolute_url(self):
        return u'/upload/projects/%d' % self.id

    def clean(self):
        if not "XML" in magic.from_file(self.file.url):
            raise ValidationError(u'Not an xml file.')

class UploadedFileForm(forms.ModelForm):
    class Meta:                
        model = UploadedFile
        exclude = ('project',)
Plasma
  • 2,369
  • 1
  • 29
  • 35
  • 1
    http://stackoverflow.com/questions/6460848/in-django-how-does-one-limit-file-types-on-file-uploads-for-modelforms-with-fil – Jingo Nov 28 '13 at 21:42
  • But this is highly ineffective - I could title a file anything I want. – Plasma Nov 29 '13 at 02:48
  • Right, this only checks the extension, but you could use this as a basis to validate the file, like opening it and checking for a valid xml. – Jingo Nov 29 '13 at 12:59
  • Is the file uploaded when `clean` runs? – Plasma Nov 29 '13 at 17:45
  • Have a look at Mikkos answer and how he handles the forms cleaned (post) data which contains the uploaded filedata. – Jingo Nov 29 '13 at 18:34

5 Answers5

44

Validating files is a common challenge, so I would like to use a validator:

import magic

from django.utils.deconstruct import deconstructible
from django.template.defaultfilters import filesizeformat


@deconstructible
class FileValidator(object):
    error_messages = {
     'max_size': ("Ensure this file size is not greater than %(max_size)s."
                  " Your file size is %(size)s."),
     'min_size': ("Ensure this file size is not less than %(min_size)s. "
                  "Your file size is %(size)s."),
     'content_type': "Files of type %(content_type)s are not supported.",
    }

    def __init__(self, max_size=None, min_size=None, content_types=()):
        self.max_size = max_size
        self.min_size = min_size
        self.content_types = content_types

    def __call__(self, data):
        if self.max_size is not None and data.size > self.max_size:
            params = {
                'max_size': filesizeformat(self.max_size), 
                'size': filesizeformat(data.size),
            }
            raise ValidationError(self.error_messages['max_size'],
                                   'max_size', params)

        if self.min_size is not None and data.size < self.min_size:
            params = {
                'min_size': filesizeformat(self.min_size),
                'size': filesizeformat(data.size)
            }
            raise ValidationError(self.error_messages['min_size'], 
                                   'min_size', params)

        if self.content_types:
            content_type = magic.from_buffer(data.read(), mime=True)
            data.seek(0)

            if content_type not in self.content_types:
                params = { 'content_type': content_type }
                raise ValidationError(self.error_messages['content_type'],
                                   'content_type', params)

    def __eq__(self, other):
        return (
            isinstance(other, FileValidator) and
            self.max_size == other.max_size and
            self.min_size == other.min_size and
            self.content_types == other.content_types
        )

Then you can use FileValidator in your models.FileField or forms.FileField as follows:

validate_file = FileValidator(max_size=1024 * 100, 
                             content_types=('application/xml',))
file = models.FileField(upload_to=settings.XML_ROOT, 
                        validators=[validate_file])
Sultan
  • 834
  • 1
  • 8
  • 16
  • 4
    you should put `data.seek(0)` after `content_type = magic.from_buffer(data.read(), mime=True)` so that a valid filed can be read again in view or file handler without explicitly seek to 0. – Grijesh Chauhan Mar 09 '15 at 14:24
  • 1
    through some googling found this https://pypi.python.org/pypi/django-validated-file/2.0 does what you described above – Clocker Nov 10 '15 at 15:10
  • Why do you add a `@deconstructible` decorator to the class? – Michiel Overtoom Jul 12 '17 at 18:40
  • 1
    `django-constrainedfilefield` is a more up to date fork of `django-validated-file`: https://github.com/mbourqui/django-constrainedfilefield It also checks image dimensions with ConstrainedImageField and includes an optional JS validator. – peterhil May 22 '19 at 19:00
  • `deconstruct` lets Django store the validator along with the model field in the migration. From the docs on [validators](https://docs.djangoproject.com/en/2.2/ref/validators/): "If a class-based validator is used in the validators model field option, you should make sure it is serializable by the migration framework by adding deconstruct() and __eq__() methods." – gatlanticus Jul 30 '19 at 23:33
  • Hey, Is this working? It isn't working for me. – Danny Mar 06 '21 at 21:15
  • For those who are facing error, add `return data` at the end of `__call__` method – Debdut Goswami Jun 22 '21 at 11:26
38

From django 1.11, you can also use FileExtensionValidator.

from django.core.validators import FileExtensionValidator
class UploadedFile(models.Model):
    file = models.FileField(upload_to=settings.XML_ROOT, 
        validators=[FileExtensionValidator(allowed_extensions=['xml'])])

Note this must be used on a FileField and won't work on a CharField (for example), since the validator validates on value.name.

ref: https://docs.djangoproject.com/en/dev/ref/validators/#fileextensionvalidator

rbennell
  • 1,134
  • 11
  • 14
  • 4
    Validating just the file name extension is not sufficient. Please use a validation method that checks the file's content with libmagic. See section 3 on: http://opensourcehacker.com/2013/07/31/secure-user-uploads-and-exploiting-served-user-content/ – peterhil May 22 '19 at 19:23
  • Yeah agreed, I didn't mean to divert away from the other answers, which is why i wrote you can 'also'. maybe i should have put 'in conjunction with the other answers' – rbennell Jun 04 '19 at 15:54
20

For posterity: the solution is to use the read method and pass that to magic.from_buffer.

class UploadedFileForm(ModelForm):
    def clean_file(self):
        file = self.cleaned_data.get("file", False)
        filetype = magic.from_buffer(file.read())
        if not "XML" in filetype:
            raise ValidationError("File is not XML.")
        return file

    class Meta:
        model = models.UploadedFile
        exclude = ('project',)
Plasma
  • 2,369
  • 1
  • 29
  • 35
  • I believe you need to invert your not in if not. Also, we should use file.read(2048), according to magic's doc: recommend using at least the first 2048 bytes, as less can produce incorrect identification – erickfis Oct 21 '21 at 11:34
5

I think what you want to do is to clean the uploaded file in Django's Form.clean_your_field_name_here() methods - the data is available on your system by then if it was submitted as normal HTTP POST request.

Also if you consider this inefficient explore the options of different Django file upload backends and how to do streaming processing.

If you need to consider the security of the system when dealing with uploads

  • Make sure uploaded file has correct extension

  • Make sure the mimetype matches the file extension

In the case you are worried about user's uploading exploit files (for attacking against your site)

  • Rewrite all the file contents on save to get rid of possible extra (exploit) payload (so you cannot embed HTML in XML which the browser would interpret as a site-origin HTML file when downloading)

  • Make sure you use content-disposition header on download

Some more info here: http://opensourcehacker.com/2013/07/31/secure-user-uploads-and-exploiting-served-user-content/

Below is my example how I sanitize the uploaded images:

class Example(models.Model):
    image = models.ImageField(upload_to=filename_gen("participant-images/"), blank=True, null=True)


class Example(forms.ModelForm):
    def clean_image(self):
        """ Clean the uploaded image attachemnt.
        """
        image = self.cleaned_data.get('image', False)
        utils.ensure_safe_user_image(image)
        return image


def ensure_safe_user_image(image):
    """ Perform various checks to sanitize user uploaded image data.

    Checks that image was valid header, then

    :param: InMemoryUploadedFile instance (Django form field value)

    :raise: ValidationError in the case the image content has issues
    """

    if not image:
        return

    assert isinstance(image, InMemoryUploadedFile), "Image rewrite has been only tested on in-memory upload backend"

    # Make sure the image is not too big, so that PIL trashes the server
    if image:
        if image._size > 4*1024*1024:
            raise ValidationError("Image file too large - the limit is 4 megabytes")

    # Then do header peak what the image claims
    image.file.seek(0)
    mime = magic.from_buffer(image.file.getvalue(), mime=True)
    if mime not in ("image/png", "image/jpeg"):
        raise ValidationError("Image is not valid. Please upload a JPEG or PNG image.")

    doc_type = mime.split("/")[-1].upper()

    # Read data from cStringIO instance
    image.file.seek(0)
    pil_image = Image.open(image.file)

    # Rewrite the image contents in the memory
    # (bails out with exception on bad data)
    buf = StringIO()
    pil_image.thumbnail((2048, 2048), Image.ANTIALIAS)
    pil_image.save(buf, doc_type)
    image.file = buf

    # Make sure the image has valid extension (can't upload .htm image)
    extension = unicode(doc_type.lower())
    if not image.name.endswith(u".%s" % extension):
        image.name = image.name + u"." + extension
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • I seem to be having trouble with this - I already have a `modelForm` for my `UploadedFile` model (see my OP), and I don't know where I can put `clean_file`. If I put it outside of `meta`, I get an error from the javascript I use to upload the file. Inside `meta`, it doesn't seem to execute. – Plasma Nov 29 '13 at 18:35
  • Or I suppose my real problem is that the `file` is still in memory (it's a `InMemoryUploadedFile` object), so I don't see how I can run any checks on the file. – Plasma Dec 01 '13 at 01:42
1

I found an interesting package who can do upload file validation recently. You can see the package here. the package approach is similar with sultan answer, thus we can just implement it right away.

from upload_validator import FileTypeValidator

validator = FileTypeValidator(
    allowed_types=['application/msword'],
    allowed_extensions=['.doc', '.docx']
)

file_resource = open('sample.doc')

# ValidationError will be raised in case of invalid type or extension
validator(file_resource)
Isfa Hany
  • 33
  • 6
  • 1
    This will not validate the file. This method just checks the extension of the file. You can rename any file and pass the validation. – ABN Feb 23 '21 at 06:10
  • What is the point of a validator when you are not checking properly. – Smith Aug 30 '22 at 12:06