112

I have a folder full of files and they don't have an extension. How can I check file types? I want to check the file type and change the filename accordingly. Let's assume a function filetype(x) returns a file type like png. I want to do this:

files = os.listdir(".")
for f in files:
    os.rename(f, f+filetype(f))

How do I do this?

martineau
  • 119,623
  • 25
  • 170
  • 301
emnoor
  • 2,528
  • 2
  • 18
  • 15
  • 3
    rel: http://stackoverflow.com/questions/43580/how-to-find-the-mime-type-of-a-file-in-python – georg Jun 07 '12 at 18:11
  • You'll have to be more specific with regard to `file types`. Do you mean determining if it's a gif, png, bmp or jpg? Do you just want to know if it's text/binary? Executable? – JoeFish Jun 07 '12 at 18:12
  • @thg435, once you have the MIME type is there a way to convert that to a suitable filename extension? – Mark Ransom Jun 07 '12 at 18:14
  • @Mark: yes, use [guess_extension](http://docs.python.org/library/mimetypes.html#mimetypes.guess_extension), but actually, mimetypes won't work here, because it's based on file extensions. What they need is libmagic (see the 2nd answer on the link). – georg Jun 07 '12 at 18:18
  • @thg435, it's not very robust - `application/jpeg` returns `.jpe` rather than the preferred `.jpg`. It really does appear to be guessing. – Mark Ransom Jun 07 '12 at 18:26
  • @Mark: no, it doesn't guess, it takes infos straight from the local mime database (/etc/mime.types or whatever). `jpe` is just happens to be the first match for image/jpeg, try `guess_all_extensions` to see them all. – georg Jun 07 '12 at 18:33
  • @JoeFish determinig if it's a gif, png, pdf, or jpg or something else – emnoor Jun 07 '12 at 19:36
  • 2
    try this https://pypi.org/project/filetype/ ? – zx1986 Jan 18 '19 at 02:20
  • Voting to reopen. This question is asking about determining the type of files without extensions, whereas the linked question, https://stackoverflow.com/q/43580/3216427, is about mime types, which are determined by looking at the extension. That's precisely what OP says they don't have. – joanis Nov 09 '21 at 20:11

10 Answers10

122

There are Python libraries that can recognize files based on their content (usually a header / magic number) and that don't rely on the file name or extension.

If you're addressing many different file types, you can use python-magic. That's just a Python binding for the well-established magic library. This has a good reputation and (small endorsement) in the limited use I've made of it, it has been solid.

There are also libraries for more specialized file types. For example, the Python standard library has the imghdr module that does the same thing just for image file types.

If you need dependency-free (pure Python) file type checking, see filetype.

phoenix
  • 7,988
  • 6
  • 39
  • 45
Chris Johnson
  • 20,650
  • 6
  • 81
  • 80
  • 3
    The package [`python-magic-win64`](https://github.com/axnsan12/python-magic-win64) worked for me in Windows – ChesuCR Jan 25 '19 at 09:52
  • 2
    [imghdr](https://docs.python.org/3/library/imghdr.html) with combination of [filetype](https://pypi.org/project/filetype/) worked for me in windows – hru_d Oct 23 '19 at 20:16
  • 1
    The [imghdr](https://docs.python.org/3/library/imghdr.html#module-imghdr) module is deprecated since version 3.11 – AlexElizard May 19 '22 at 06:56
  • 1
    For python magic there are dependencies: Windows: `pip install python-magic-bin` and Linux: `sudo apt-get install libmagic1` are required. – Matt Oct 10 '22 at 14:21
  • 2
    Also note that **filetype** and **python-magic** can produce quite different results, e.g. if you have a `*.docx` file, then filetype reports `"application/zip"` as MIME type (because essentialy a docx file consists of a zip container containing XML documents) while python-magic says `"application/vnd.openxmlformats-officedocument.wordprocessingml.document"`, which is more precise (i.e. it looks inside the ZIP container). – Matt Oct 10 '22 at 14:30
  • The latest release of *filetype* module could correctly guess `.docx` file as `application/vnd.openxmlformats-officedocument.wordprocessingml.document` – oeter Mar 31 '23 at 16:44
76

The Python Magic library provides the functionality you need.

You can install the library with pip install python-magic and use it as follows:

>>> import magic

>>> magic.from_file('iceland.jpg')
'JPEG image data, JFIF standard 1.01'

>>> magic.from_file('iceland.jpg', mime=True)
'image/jpeg'

>>> magic.from_file('greenland.png')
'PNG image data, 600 x 1000, 8-bit colormap, non-interlaced'

>>> magic.from_file('greenland.png', mime=True)
'image/png'

The Python code in this case is calling to libmagic beneath the hood, which is the same library used by the *NIX file command. Thus, this does the same thing as the subprocess/shell-based answers, but without that overhead.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • 7
    Beware that the debian/ubuntu package called python-magic is different to the pip package of the same name. Both are `import magic` but have incompatible contents. See http://stackoverflow.com/a/16203777/3189 for more. – Hamish Downer Apr 28 '15 at 11:13
  • 1
    @Richard Do you mind elaborating on the overhead aspect? What makes the `python-magic` library more efficient then using subprocess approaches? – Greg Mar 29 '17 at 15:10
  • 2
    Superb answer. If you see `failed to find libmagic. Check your installation`, then run `brew install libmagic` and try it again – stevec Feb 07 '21 at 13:56
11

On unix and linux there is the file command to guess file types. There's even a windows port.

From the man page:

File tests each argument in an attempt to classify it. There are three sets of tests, performed in this order: filesystem tests, magic number tests, and language tests. The first test that succeeds causes the file type to be printed.

You would need to run the file command with the subprocess module and then parse the results to figure out an extension.

edit: Ignore my answer. Use Chris Johnson's answer instead.

Community
  • 1
  • 1
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • +1 I didn't realize `file` did that much. `# file arc.gif arc.gif: GIF image data, version 89a, 234 x 269` – JoeFish Jun 07 '12 at 18:20
  • Well, I was hoping someone had a better answer. There's still a lot of work for the OP, it's not a simple function call. – Steven Rumbalski Jun 07 '12 at 18:22
  • 3
    +1 One benefit with using the `file` command is that it is native on (most?) Linux distributions while the `python-magic` is not and has to be downloaded and installed before it can be used. This is somewhat of a problem if the script using the module is supposed to be portable. – HelloGoodbye Jan 22 '14 at 18:52
10

In the case of images, you can use the imghdr module.

>>> import imghdr
>>> imghdr.what('8e5d7e9d873e2a9db0e31f9dfc11cf47')  # You can pass a file name or a file object as first param. See doc for optional 2nd param.
'png'

Python 2 imghdr doc
Python 3 imghdr doc

phoenix
  • 7,988
  • 6
  • 39
  • 45
Lewis Diamond
  • 23,164
  • 2
  • 24
  • 32
  • imghdr is deprecated in python 3.11 https://docs.python.org/3/library/imghdr.html - `filetype` looks to work well in place of it https://pypi.org/project/filetype/ – Lucas Walter Dec 19 '22 at 15:44
7
import subprocess as sub
p = sub.Popen('file yourfile.txt', stdout=sub.PIPE, stderr=sub.PIPE)
output, errors = p.communicate()
print(output)

As Steven pointed out, subprocess is the way. You can get the command output by the way above as this post said

vaeVictis
  • 484
  • 1
  • 3
  • 13
xvatar
  • 3,229
  • 17
  • 20
6

You can also install the official file binding for Python, a library called file-magic (it does not use ctypes, like python-magic).

It's available on PyPI as file-magic and on Debian as python-magic. For me this library is the best to use since it's available on PyPI and on Debian (and probably other distributions), making the process of deploying your software easier. I've blogged about how to use it, also.

Álvaro Justen
  • 1,943
  • 1
  • 17
  • 17
4

With newer subprocess library, you can now use the following code (*nix only solution):

import subprocess
import shlex

filename = 'your_file'
cmd = shlex.split('file --mime-type {0}'.format(filename))
result = subprocess.check_output(cmd)
mime_type = result.split()[-1]
print mime_type
berniey
  • 2,772
  • 1
  • 18
  • 8
  • Thanks for the answer. BTW, you should not use a str.split() on a cmd line. use shlex.split(cmd) insteed. – emnoor Jun 06 '14 at 12:14
  • 1
    Instead of using `shlex.split`, why not just run `subprocess.check_output(['file', '--mime-type', filename])`? – Flimm Aug 03 '16 at 07:05
2

also you can use this code (pure python by 3 byte of header file):

full_path = os.path.join(MEDIA_ROOT, pathfile)

try:
    image_data = open(full_path, "rb").read()
except IOError:
    return "Incorrect Request :( !!!"

header_byte = image_data[0:3].encode("hex").lower()

if header_byte == '474946':
    return "image/gif"
elif header_byte == '89504e':
    return "image/png"
elif header_byte == 'ffd8ff':
    return "image/jpeg"
else:
    return "binary file"

without any package install [and update version]

evergreen
  • 7,771
  • 2
  • 17
  • 25
  • How can I check for xlsx? – Harsha Biyani May 14 '20 at 10:34
  • You can used by 4 or 8 bytes. XLSX(MS Office Open XML Format Document) => 50 4B 03 04 (4 Bytes) => ASCII (PK••) ***or*** XLSX(MS Office 2007 documents) => 50 4B 03 04 14 00 06 00 (8 Bytes) => ASCII (PK••••••) – evergreen May 14 '20 at 16:29
  • this comment gives a very interesting idea, for me, it's pythonic although it is not accurate. It should be as Alya Mad's question https://stackoverflow.com/questions/69561458/how-to-check-type-of-files-using-the-header-file-signature-magic-numbers and more extension https://en.wikipedia.org/wiki/List_of_file_signatures – lam vu Nguyen Mar 14 '23 at 07:12
0

Only works for Linux but Using the "sh" python module you can simply call any shell command

https://pypi.org/project/sh/

pip install sh

import sh

sh.file("/root/file")

Output: /root/file: ASCII text

Lelouch
  • 549
  • 6
  • 6
0

This code list all files of a given extension in a given folder recursively

import magic
import glob
from os.path import isfile

ROOT_DIR = 'backup'
WANTED_EXTENSION = 'sqlite'

for filename in glob.iglob(ROOT_DIR + '/**', recursive=True):
    if isfile(filename):
        extension = magic.from_file(filename, mime = True)
        if WANTED_EXTENSION in extension:
            print(filename)

https://gist.github.com/izmcm/6a5d6fa8d4ec65fd9851a1c06c8946ac