5

Possible Duplicate:
Django (audio) File Validation

I'm building a web app, where users are able to upload media content, including audio files.

I've got a clean method in my AudioFileUploadForm that validates the following:

  • That the audio file isn't too big.
  • That the audio file has a valid content_type (MIME type).
  • That the audio file has a valid extension.

However, I'm worried about security. A user could upload a file with malicious code, and easily pass the above validations. What I want to do next is validate that the audio file is, indeed, an audio file (before it writes to disk).

How should I do this?

class UploadAudioForm(forms.ModelForm):
    audio_file = forms.FileField()

    def clean_audio_file(self):
        file = self.cleaned_data.get('audio_file',False):
            if file:
                if file._size > 12*1024*1024:
                    raise ValidationError("Audio file too large ( > 12mb )")
                if not file.content_type in ['audio/mpeg','audio/mp4', 'audio/basic', 'audio/x-midi', 'audio/vorbis', 'audio/x-pn-realaudio', 'audio/vnd.rn-realaudio', 'audio/x-pn-realaudio', 'audio/vnd.rn-realaudio', 'audio/wav', 'audio/x-wav']:
                    raise ValidationError("Sorry, we do not support that audio MIME type. Please try uploading an mp3 file, or other common audio type.")
                if not os.path.splitext(file.name)[1] in ['.mp3', '.au', '.midi', '.ogg', '.ra', '.ram', '.wav']:
                    raise ValidationError("Sorry, your audio file doesn't have a proper extension.")
                # Next, I want to read the file and make sure it is 
                # a valid audio file. How should I do this? Use a library?
                # Read a portion of the file? ...?
                if not ???.is_audio(file.content):
                    raise ValidationError("Not a valid audio file.")
                return file
            else:
                raise ValidationError("Couldn't read uploaded file")

EDIT: By "validate that the audio file is, indeed, an audio file", I mean the following:

A file that contains data typical of an audio file. I'm worried that a user could upload files with appropriate headers, and malicious script in the place of audio data. For example... is the mp3 file an mp3 file? Or does it contain something uncharacteristic of an mp3 file?

Community
  • 1
  • 1
sgarza62
  • 5,998
  • 8
  • 49
  • 69
  • define *valid audio file*? –  Jan 10 '13 at 18:26
  • Use a server side virus scanner –  Jan 10 '13 at 18:27
  • @AamirAdnan Not so. If you read the answer to the question you're referencing, you'll see that the answerer was not aware of which process to use to read and further validate the audio file...thus leaving that part unfinished. In fact, he explicitly said "I don't know what the best library is to do this". My question is, what library or approach should I use to do this. – sgarza62 Jan 10 '13 at 19:02

2 Answers2

5

An alternative to the other posted answer which does header parsing. This means someone could still include other data behind a valid header.

Is to verify the whole file which costs more CPU but also has a stricter policy. A library that can do this is python audiotools and the relevant API method is AudioFile.verify.

Used like this:

import audiotools

f = audiotools.open(filename)
try:
    result = f.verify()
except audiotools.InvalidFile:
    # Invalid file.
    print("Invalid File")
else:
    # Valid file.
    print("Valid File")

A warning is that this verify method is quite strict and can actually flag badly encoded files as invalid. You have to decide yourself if this is a proper method or not for your use case.

Wessie
  • 3,460
  • 2
  • 13
  • 17
  • 1
    +1, coming from the Plone-world, anyway this gave some precious general insights in validating file-types, thanks Wessie! – Ida Mar 27 '13 at 09:33
1

The idea is to use an utility such as sndhdr to try to read a header of the file.

from sndhdr import what
import os
from settings import MEDIA_ROOT
...
if what( os.path.join( MEDIA_ROOT, file.name ) ) == None:
# header parse failed, so, it could be a non-audio file.

In this case you must be sure that the utility could not recognize a non-audio file as an audio file.

I never use the sndhdr utility, so, it may be better to use another one. For example there is also mutagen project.

Update.

The more professional approach.

  1. Let a user upload his files without the such validation.
  2. Create a special model field associated with one file: is_checked. Set the field False by default.
  3. Don't let other users to downoad the files which have is_checked == False
  4. Organize an asynchronous task that runs an anti-malware soft that checked unchecked files. The software must be called as a function from your pythonic task. The function must answer the question: is a particular file has a malware part.
  5. If the current file has the malware part delete it and delete the record about the file. Otherwise set up is_checked = True from your task.

Why a separated task? Because the such malware checking can be a long operation.

sergzach
  • 6,578
  • 7
  • 46
  • 84
  • Referencing the code in the question above, `file.content_type` already reads the header, then returns the content-type listed in the header. But Django documentation says this isn't enough: (https://docs.djangoproject.com/en/dev/topics/http/file-uploads/#django.core.files.uploadedfile.UploadedFile.content_type). Would using sndhdr simply duplicate what's being done with `file.content_type`? – sgarza62 Jan 10 '13 at 18:58
  • 1
    @sgarza62 I think that Django uses the content-type sent by your browser. – sergzach Jan 10 '13 at 19:13
  • Thanks for your answer. This is an interesting approach, however, I've looked at the source code and sndhdr does not recognize mp3, realaudio, among other common audio types. – sgarza62 Jan 10 '13 at 19:28
  • @sgarza62 I have updated my answer. Please look. – sergzach Jan 10 '13 at 19:48
  • Great update. The only problem (for this specific case) is I need the operation to be run before it writes to disk, because our web app renders uploaded content in real-time. I think the information provided in your update will be of help to other users. – sgarza62 Jan 10 '13 at 20:32
  • @sgarza62 Render in real time? How can we play audio before the user request ends? – sergzach Jan 10 '13 at 21:36