17

I am trying to loop through a list of files, and return those files that are media files (images, video, gif, audio, etc.).

Seeing as there are a lot of media types, is there a library or perhaps better way to check this, than listing all types then checking a file against that list?

Here's what I'm doing so far:

import os
types = [".mp3", ".mpeg", ".gif", ".jpg", ".jpeg"]
files = ["test.mp3", "test.tmp", "filename.mpg", ".AutoConfig"]

media_files = []
for file in files:
    root, extention = os.path.splitext(file)
    print(extention)
    if extention in types:
        media_files.append(file)

print("Found media files are:")
print(media_files)

But note it didn't include filename.mpg, since I forgot to put .mpg in my types list. (Or, more likely, I didn't expect that list to include a .mpg file, so didn't think to list it out.)

BruceWayne
  • 22,923
  • 15
  • 65
  • 110
  • Yes, you can you mimetype check. Here is a example: [stackoverflow.com](https://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python) – Cpp Forever Mar 22 '19 at 21:25
  • If you're running on UNIX/Linux, you can use `file` to determine media type. – tk421 Mar 22 '19 at 21:26
  • @CppForever - I found that, and am studying that library, but am not sure how to check without something like - `if mime.from_file("media.mp3") == "application/mp3" or ...:`? I am missing understanding something I think... – BruceWayne Mar 22 '19 at 21:26
  • You need to use internet media type. For example .mp3 became audio/mpeg – Cpp Forever Mar 22 '19 at 21:32
  • 1
    @CppForever so do I just heck generally "is the file a mime type" without having to check exactly what kind? – BruceWayne Mar 22 '19 at 21:41
  • 1
    After you get mime type for example audio/mp3 you can split by / character and get the first part and check if it is audio or video or image – Cpp Forever Mar 22 '19 at 21:43
  • Here are some websites that may help: `https://en.wikipedia.org/wiki/Video_file_format`, `https://www.encoding.com/formats/`, `https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types`, `https://pro.europeana.eu/page/media-formats-mime-types` and `https://www.iana.org/assignments/media-types/media-types.xhtml`. Also, if you don't want to add them manually to your list, you can use the packages that detect the Media MIME types or other libraries as in most of the answers. – The Amateur Coder Oct 01 '21 at 18:56
  • @BruceWayne, there may be some types that aren't listed by the libraries. For example: `.vproj` is VSDC Video Editor's file and is not listed by MIME's media category nor by the libraries, as it isn't registered by the VSDC team unlike YouTube's `.youtube.yt` and `.yt`, or Adobe's `adobe.flash.movie` and `adobe.xfdf` and `adobe.photoshop`. Many such file types, even tho media files, may not be listed in the libraries; as they aren't registered. There are a lot more application-specific files that you can manually add to your list. – The Amateur Coder Oct 01 '21 at 19:13

4 Answers4

23

For this purpose you need to get internet media type for file, split it by / character and check if it starts with audio,video,image.

Here is a sample code:

import mimetypes
mimetypes.init()

mimestart = mimetypes.guess_type("test.mp3")[0]

if mimestart != None:
    mimestart = mimestart.split('/')[0]

    if mimestart in ['audio', 'video', 'image']:
        print("media types")

NOTE: This method assume the file type by its extension and don't open the actual file, it is based only on the file extension.

Creating a module

If you want to create a module that checks if the file is a media file you need to call the init function at the start of the module.

Here is an example of how to create the module:

ismediafile.py

import mimetypes
mimetypes.init()

def isMediaFile(fileName):
    mimestart = mimetypes.guess_type(fileName)[0]

    if mimestart != None:
        mimestart = mimestart.split('/')[0]

        if mimestart in ['audio', 'video', 'image']:
            return True
    
    return False

and there how to use it:

main.py

from ismediafile import isMediaFile

if __name__ == "__main__":
    if isMediaFile("test.mp3"):
        print("Media file")
    else:
        print("not media file")
Cpp Forever
  • 900
  • 8
  • 16
  • 1
    You should check for `None` (i.e. unknown type) when using [guess_type](https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_type). However, you should note that this method will only check the extension, so it cannot detect the file's actual type. – ekhumoro Mar 22 '19 at 22:47
  • 1
    While `mimetypes` does exactly what the OP asks, maybe also point to https://pypi.org/project/python-libmagic/ which inspects file contents, not just the filename. `libmagic` is the library behind the Unix `file` command. – tripleee Mar 23 '19 at 10:44
  • 1
    Thanks so much for this! I really appreciate both answers too, I like this since I know the filetypes and can check without opening the file. Thanks **so** much! :D – BruceWayne Mar 23 '19 at 18:19
  • 1
    For anyone - [Here's a list of common Web MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types). And [here's a longer list](https://www.iana.org/assignments/media-types/media-types.xhtml) (but I note it doesn't include `mp3` for some reason). – BruceWayne Mar 23 '19 at 21:52
  • simply brilliant. Should I call mimetypes.init() when my script is imported or wrapped in a function before I guess the mimetype? – Derek Adair Sep 24 '21 at 22:23
  • 1
    Added the answer on how to integrate the code into a module. – Cpp Forever Sep 29 '21 at 16:53
  • @Derek Adair you need to call `mimetypes.init()` at the start of your script before you guess the mimetype, see my edit. – Cpp Forever Oct 02 '21 at 08:32
4

There is another method that is based not on the file extension but on the file contents using the media type library pypi.org/project/python-libmagic:

Here is the sample code for this library:

import magic

magic = magic.Magic()
mimestart = magic.from_file("test.mp3").split('/')[0]

if mimestart in ['audio', 'video', 'image']:
    print("media types")

NOTE: for using this code sample you need to install python-libmagic using pip.

Cpp Forever
  • 900
  • 8
  • 16
  • 1
    I assume this method, where it actually checks the file contents itself, is most often used when you can't trust the extension, for whatever reason? Thanks for this! – BruceWayne Mar 23 '19 at 18:27
  • For example in linux the executables don't have extension but have a signature. – Cpp Forever Mar 23 '19 at 18:34
  • 1
    ohhh okay! I am running Windows but using raspberry pi so that will likely come in handy. Thanks again!! – BruceWayne Mar 23 '19 at 18:54
1

Another option would be to leverage FFmpeg, which supports most media formats in existence. This can be especially useful when wanting to know more about the media type of each file.

Using the ffprobe-python package (pip install ffprobe-python):

from ffprobe import FFProbe

# try probing the file with ffmpeg
# if no streams are found, it's not in a format that ffmpeg can read
# -> not considered media file
media_files = [file for file in files if len(FFProbe(file).streams)]

This approach may be considerably slower than just reading the file extensions or MIME types, as it may ingest the complete file. On the other hand, it would be possible to have more information on the type of media that is contained, and the metadata.

Selecting only files containing audio:

has_audio = [file for file in files if len(FFProbe(file).audio)]

Similar for images and videos:

has_img_or_vid = [file for file in files if len(FFProbe(file).video)]

Or collecting the codec names:

codecs = {file: [s.codec_name for s in FFProbe(f).streams] for f in files}
w-m
  • 10,772
  • 1
  • 42
  • 49
0

You may list media files as follows:

import os

def lsmedia(mypath):
    img_fm = (".tif", ".tiff", ".jpg", ".jpeg", ".gif", ".png", ".eps", 
          ".raw", ".cr2", ".nef", ".orf", ".sr2", ".bmp", ".ppm", ".heif")
    vid_fm = (".flv", ".avi", ".mp4", ".3gp", ".mov", ".webm", ".ogg", ".qt", ".avchd")
    aud_fm = (".flac", ".mp3", ".wav", ".wma", ".aac")
    media_fms = {"image": img_fm, "video": vid_fm, "audio": aud_fm}

    fns = lambda path, media : [fn for fn in os.listdir(path) if any(fn.lower().endswith(media_fms[media]) for ext in media_fms[media])]
    img_fns, vid_fns, aud_fns = fns(mypath, "image"), fns(mypath, "video"), fns(mypath, "audio")

    print(f"State of media in '{mypath}'")
    print("Images: ", len(img_fns), " | Videos: ", len(vid_fns), "| Audios: ", len(aud_fns))
    
    return (img_fns, vid_fns, aud_fns)

mypath = "/home/DATA_Lia/data_02/sample" # define dir
(imgs, vids, auds) = lsmedia(mypath)

output:

State of media in '/home/DATA_Lia/data_02/sample'
Images:  24  | Videos:  3 | Audios:  5
San Askaruly
  • 311
  • 1
  • 10