128

How can I tell if a file is binary (non-text) in Python?

I am searching through a large set of files in Python, and keep getting matches in binary files. This makes the output look incredibly messy.

I know I could use grep -I, but I am doing more with the data than what grep allows for.

In the past, I would have just searched for characters greater than 0x7f, but utf8 and the like, make that impossible on modern systems. Ideally, the solution would be fast.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
grieve
  • 13,220
  • 10
  • 49
  • 61
  • IF "in the past I would have just searched for characters greater than 0x7f" THEN you used to work with plain ASCII text THEN still no issue since ASCII text encoded as UTF-8 remains ASCII (i.e. no bytes > 127). – tzot May 22 '09 at 18:40
  • @ΤΖΩΤΖΙΟΥ: True, but I happen to know that the some of the files I am dealing with are utf8. I meant used to in the general sense, not in the specific sense of these files. :) – grieve May 22 '09 at 21:19
  • 1
    Only with probability. You can check if: 1) file contains \n 2) Amount of bytes between \n's is relatively small (this is NOT reliable)l 3) file doesn't bytes with value less than value of ASCCI "space" character (' ') - EXCEPT "\n" "\r" "\t" and zeroes. – SigTerm Jun 09 '10 at 01:26
  • 3
    The strategy that `grep` itself uses to identify binary files is similar to that posted by Jorge Orpinel [below](http://stackoverflow.com/questions/898669/how-can-i-detect-if-a-file-is-binary-non-text-in-python/3002505#3002505). Unless you set the `-z` option, it will just scan for a null character (`"\000"`) in the file. With `-z`, it scans for `"\200"`. Those interested and/or skeptical can check line 1126 of `grep.c`. Sorry, I couldn't find a webpage with the source code, but of course you can get it from http://gnu.org or via a [distro](http://packages.ubuntu.com/en/lucid/grep). – intuited Oct 13 '10 at 08:18
  • 3
    P.S. As mentioned in the comments thread for Jorge's post, this strategy will give false positives for files containing, for example, UTF-16 text. Nonetheless, both `git diff` and GNU `diff` also use the same strategy. I'm not sure if it's so prevalent because it's so much faster and easier than the alternative, or if it's just because of the relative rarity of UTF-16 files on systems which tend to have these utils installed. – intuited Oct 13 '10 at 08:21
  • Use a library (see my answer below). – guettli Nov 08 '14 at 20:19
  • Use `perl -ne 'print if -B' filename`, see https://stackoverflow.com/questions/29516984/how-to-find-binary-files-in-a-directory. See https://github.com/Perl/perl5/blob/blead/pp_sys.c#L3543 for implementation. – Hans Ginzel Jan 17 '21 at 21:42

21 Answers21

78

Yet another method based on file(1) behavior:

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

Example:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Can get both false positive and false negatives, but still is a clever approach that works for the large majority of files. +1. – spectras Aug 24 '15 at 14:40
  • 3
    Interestingly enough, file(1) itself excludes 0x7f from consideration as well, so technically speaking you should be using `bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x7f)) + bytearray(range(0x80, 0x100))` instead. See [Python, file(1) - Why are the numbers \[7,8,9,10,12,13,27\] and range(0x20, 0x100) used for determining text vs binary file](http://stackoverflow.com/q/32184809) and https://github.com/file/file/blob/b52ef6e698a2098afb32d13ace50a78f0f0f0af4/src/encoding.c#L151-L228 – Martijn Pieters Aug 24 '15 at 15:57
  • 2
    @MartijnPieters: thank you. I've updated the answer to exclude `0x7f` (`DEL`) . – jfs Aug 24 '15 at 16:16
  • 1
    Nice solution using sets. :-) – Martijn Pieters Aug 24 '15 at 16:19
  • Why do you exclude `11` or `VT`? In the table 11 is considered plain ASCII text, and this is the [`vertical tab`](https://en.wikipedia.org/wiki/Tab_key). – darksky Jul 28 '16 at 20:04
  • @darksky : good catch. From the `file(1)` link: *"I exclude vertical tab because it never seems to be used in real text."* This behavior has changed between different `file(1)` versions (perhaps, the link should point to an earlier version). The method is just an heuristic, use whatever works best in your case. – jfs Jul 28 '16 at 20:30
  • Does Python guarantee the file will be immediately closed if you don't use a `with` statement to read those 1024 bytes? – Mark Ransom Jan 04 '18 at 17:19
  • 1
    @MarkRansom to make sure a file is closed, use the `with`-statement or call `.close()` method explicitly. – jfs Jan 04 '18 at 17:21
  • 1
    I only bring it up because you don't do either of those things in this answer. – Mark Ransom Jan 04 '18 at 17:22
  • @MarkRansom it is just a REPL example. I'm sure files that you want to check are not called `/usr/bin/python` literally too. – jfs Jan 04 '18 at 17:29
  • @scott bytes is not str. – jfs Jul 07 '19 at 06:42
  • This method detect text file as Binary file if text file contains BOM UTF-16 LE – Murtuza Z Apr 16 '20 at 06:10
  • @MurtuzaZ: It is expected for UTF-16, UTF-32 (they contain zero bytes). – jfs Apr 16 '20 at 15:26
50

You can also use the mimetypes module:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.

Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
Gavin M. Roy
  • 4,551
  • 4
  • 33
  • 29
  • 25
    Is there a way to get `mimetypes` to use the contents of a file rather than just its name? – intuited Oct 13 '10 at 07:01
  • 7
    @intuited No, but libmagic does that. Use it via [python-magic](https://github.com/ahupp/python-magic). – Bengt Jun 30 '12 at 01:25
  • There is a similar question with some good answers here: http://stackoverflow.com/questions/1446549/how-to-identify-binary-and-text-files-using-python The answer based on an activestate recipe looks good to me, it allows a small proportion of non-printable characters (but no \0, for some reason). – Sam Watkins Mar 14 '13 at 02:57
  • 6
    This isn't a great answer only because the mimetypes module is not good for all files. I'm looking at a file now which system `file` reports as "UTF-8 Unicode text, with very long lines" but mimetypes.gest_type() will return (None, None). Also, Apache's mimetype list is a whitelist/subset. It is by no means a complete list of mimetypes. It cannot be used to classify all files as either text or non-text. – Purrell Feb 26 '15 at 22:21
  • 2
    guess_types is based on the file name extension, not the real content as the Unix command "file" would do. – Eric H. Jun 20 '17 at 07:00
  • \0 (null) auto fails because there should never be a null in a text file. Most text editors see that and that's where the text file is considered to end. – UtahJarhead Sep 04 '18 at 15:27
  • I can confirm, `guess_type` is based on the file extension. Also, in the example code, `file` is actually a string. – RobertG Aug 14 '19 at 19:06
35

If you're using python3 with utf-8 it is straight forward, just open the file in text mode and stop processing if you get an UnicodeDecodeError. Python3 will use unicode when handling files in text mode (and bytearray in binary mode) - if your encoding can't decode arbitrary files it's quite likely that you will get UnicodeDecodeError.

Example:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data
Calaelen
  • 63
  • 8
skyking
  • 13,817
  • 1
  • 35
  • 57
9

Try this:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False
Jorge Orpinel
  • 149
  • 1
  • 2
  • 12
    -1 defines "binary" as containing a zero byte. Will classify UTF-16-encoded text files as "binary". – John Machin Jun 09 '10 at 01:34
  • 6
    @John Machin: Interestingly, `git diff` actually [works this way](http://git.kernel.org/?p=git/git.git;a=blob;f=xdiff-interface.c;h=e1e054e4d982de30d8a9c8c4109c6d62448f62a9;hb=HEAD#l240), and sure enough, it detects UTF-16 files as binary. – intuited Oct 13 '10 at 06:57
  • Hunh.. GNU `diff` also works this way. It has similar issues with UTF-16 files. `file` does correctly detect the same files as UTF-16 text. I haven't checked out `grep` 's code, but it too detects UTF-16 files as binary. – intuited Oct 13 '10 at 07:57
  • 1
    +1 @John Machin: utf-16 is a character data according to [`file(1)`](http://linux.die.net/man/1/file) that is not safe to print without conversion so this method is appropriate in this case. – jfs Sep 12 '11 at 18:32
  • 2
    -1 - I don't think 'contains a zero byte' is an adequate test for binary vs text, for example I can create a file containing all 0x01 bytes or repeat 0xDEADBEEF, but it is not a text file. The answer based on file(1) is better. – Sam Watkins Mar 14 '13 at 02:54
8

If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.

Shane C. Mason
  • 7,518
  • 3
  • 26
  • 33
  • That is what libmagic is for. It can be accessed in python via [python-magic](https://github.com/ahupp/python-magic). – Bengt Jun 29 '12 at 23:56
  • 4
    Unfortunately, "does not begin with a known magic number" is not equivalent to "is a text file". – Purrell Feb 26 '15 at 22:29
7

Use binaryornot library (GitHub).

It is very simple and based on the code found in this stackoverflow question.

You can actually write this in 2 lines of code, however this package saves you from having to write and thoroughly test those 2 lines of code with all sorts of weird file types, cross-platform.

kenorb
  • 155,785
  • 88
  • 678
  • 743
guettli
  • 25,042
  • 81
  • 346
  • 663
7

We can use python itself to check if a file is binary, because it fails if we try to open binary file in text mode

def is_binary(file_name):
    try:
        with open(file_name, 'tr') as check_file:  # try open file in text mode
            check_file.read()
            return False
    except:  # if fail then file is non-text (binary)
        return True
Caco
  • 1,601
  • 1
  • 26
  • 53
Serhii
  • 79
  • 1
  • 1
6

Here's a suggestion that uses the Unix file command:

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

Example usage:

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

It has the downsides of not being portable to Windows (unless you have something like the file command there), and having to spawn an external process for each file, which might not be palatable.

Jacob Gabrielson
  • 34,800
  • 15
  • 46
  • 64
  • This broke my script :( Investigating, I found out that some conffiles are described by `file` as "Sendmail frozen configuration - version m"—notice the absence of the string "text". Perhaps use `file -i`? – melissa_boiko Jan 22 '16 at 13:32
  • 3
    TypeError: cannot use a string pattern on a bytes-like object – abg Jun 18 '17 at 16:32
5

Try using the currently maintained python-magic which is not the same module in @Kami Kisiel's answer. This does support all platforms including Windows however you will need the libmagic binary files. This is explained in the README.

Unlike the mimetypes module, it doesn't use the file's extension and instead inspects the contents of the file.

>>> import magic
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'
Jossef Harush Kadouri
  • 32,361
  • 10
  • 130
  • 129
Eat at Joes
  • 4,937
  • 1
  • 40
  • 40
5
from binaryornot.check import is_binary
is_binary('filename')

Documentation

j-tesla
  • 67
  • 2
  • 7
4

A shorter solution, with a UTF-16 warning:

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False
Kieee
  • 126
  • 2
  • 15
  • note: `for line in file` may consume unlimited amount of memory until `b'\n'` is found – jfs Apr 29 '14 at 03:01
  • to @Community: `".read()"` returns a bytestring here that *is* iterable (it yields individual bytes). – jfs Apr 29 '14 at 03:02
4

Usually you have to guess.

You can look at the extensions as one clue, if the files have them.

You can also recognise know binary formats, and ignore those.

Otherwise see what proportion of non-printable ASCII bytes you have and take a guess from that.

You can also try decoding from UTF-8 and see if that produces sensible output.

Douglas Leeder
  • 52,368
  • 9
  • 94
  • 137
3

Here's a function that first checks if the file starts with a BOM and if not looks for a zero byte within the initial 8192 bytes:

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

Technically the check for the UTF-8 BOM is unnecessary because it should not contain zero bytes for all practical purpose. But as it is a very common encoding it's quicker to check for the BOM in the beginning instead of scanning all the 8192 bytes for 0.

roskakori
  • 3,139
  • 1
  • 30
  • 29
3

All of these basic methods were incorporated into a Python library: binaryornot. Install with pip.

From the documentation:

>>> from binaryornot.check import is_binary
>>> is_binary('README.rst')
False
RexBarker
  • 1,456
  • 16
  • 14
3

If you're not on Windows, you can use Python Magic to determine the filetype. Then you can check if it is a text/ mime type.

Kamil Kisiel
  • 19,723
  • 11
  • 46
  • 56
2

Most of the programs consider the file to be binary (which is any file that is not "line-oriented") if it contains a NULL character.

Here is perl's version of pp_fttext() (pp_sys.c) implemented in Python:

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

Note also that this code was written to run on both Python 2 and Python 3 without changes.

Source: Perl's "guess if file is text or binary" implemented in Python

umläute
  • 28,885
  • 9
  • 68
  • 122
kenorb
  • 155,785
  • 88
  • 678
  • 743
1

I guess that the best solution is to use the guess_type function. It holds a list with several mimetypes and you can also include your own types. Here come the script that I did to solve my problem:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

It is inside of a Class, as you can see based on the ustructure of the code. But you can pretty much change the things you want to implement it inside your application. It`s quite simple to use. The method getTextFiles returns a list object with all the text files that resides on the directory you pass in path variable.

kenorb
  • 155,785
  • 88
  • 678
  • 743
Leonardo
  • 9
  • 2
1

on *NIX:

If you have access to the file shell-command, shlex can help make the subprocess module more usable:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

Or, you could also stick that in a for-loop to get output for all files in the current dir using:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

or for all subdirs:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))
Rob Truxal
  • 5,856
  • 4
  • 22
  • 39
1

I came here looking for exactly the same thing--a comprehensive solution provided by the standard library to detect binary or text. After reviewing the options people suggested, the nix file command looks to be the best choice (I'm only developing for linux boxen). Some others posted solutions using file but they are unnecessarily complicated in my opinion, so here's what I came up with:

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

It should go without saying, but your code that calls this function should make sure you can read a file before testing it, otherwise this will be mistakenly detect the file as binary.

rsaw
  • 3,315
  • 2
  • 28
  • 30
0

Simpler way is to check if the file consist NULL character (\x00) by using in operator, for instance:

b'\x00' in open("foo.bar", 'rb').read()

See below the complete example:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

Sample usage:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!
kenorb
  • 155,785
  • 88
  • 678
  • 743
0

are you in unix? if so, then try:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

The shell return values are inverted (0 is ok, so if it finds "text" then it will return a 0, and in Python that is a False expression).

fortran
  • 74,053
  • 25
  • 135
  • 175
  • For reference, the file command guesses a type based on the file's content. I'm not sure whether it pays any attention to the file extension. – David Z May 22 '09 at 17:23
  • I'm almost sure it looks both in the content and the extension. – fortran May 22 '09 at 18:50
  • This breaks if the path contains "text", tho. Make sure to rsplit at the last ':' (provided there's no colon in the file type description). – Alan Plum Oct 28 '09 at 18:16
  • 3
    Use `file` with the `-b` switch; it'll print only the file type without the path. – dubek Dec 23 '09 at 16:18
  • 2
    a slightly nicer version: `is_binary_file = lambda filename: "text" in subprocess.check_output(["file", "-b", filename])` – jfs Sep 12 '11 at 19:08