How to identify binary and text files using Python?

Question

I need identify which file is binary and which is a text in a directory.

I tried use mimetypes but it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn´t find a solution...

Thanks

What is a text file for you? Does UTF-16-BE encoded Unicode count, for example? — , Sep 18 '09 at 20:04
You need to define precisely what is meant by 'binary' and 'text' before anyone can help you. — Grzegorz Oledzki, Sep 18 '09 at 20:07
Text file is any file that is readable by humans. Say, any file that you can read by a "cat" (linux) or "type" (windows) command. — Thomas, Sep 19 '09 at 14:07
This similar question has a few good answers, http://stackoverflow.com/questions/898669/how-can-i-detect-if-a-file-is-binary-non-text-in-python file(1) is pretty reliable, so you could go with the pure-python solution that is based on file(1) behaviour; or you could trust the mimetypes module. — Sam Watkins, Mar 14 '13 at 02:52
Use this library: https://pypi.python.org/pypi/binaryornot/ It is very simple and based on code found in this stackoverflow question. — guettli, Nov 07 '14 at 09:10

score 11 · Accepted Answer · edited Sep 23 '14 at 09:12

11

Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/ and I changed just a little piece to suit me.

It works fine.

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "\0" in s:
        # Files with null bytes are likely binary
        return False
    # Get the non-text characters (maps a character to itself then
    # use the 'remove' option to get rid of the text characters.)
    t = s.translate(_null_trans, text_characters)
    # If more than 30% non-text characters, then
    # this is considered a binary file
    if float(len(t))/float(len(s)) > 0.30:
        return False
    return True

edited Sep 23 '14 at 09:12

Copas

5,921
5
29
43

answered Sep 18 '09 at 21:15

Thomas

2,256
6
32
47

7

A little correction for your code : `if float(len(t))/float(len(s)) > 0.30: return 0` Otherwise, python will use the integer division, and the comparison will only be true when len(t) == len(s) – Cédric Julien Oct 01 '11 at 09:34
1

Thomas, please apply that "float" correction to the answer! Activestate should fix their recipe, too! ;) but I can't be bothered signing up to bump the comments there. – Sam Watkins Mar 14 '13 at 02:59
also there is a trailing * on the last line, should not be there – Sam Watkins Mar 14 '13 at 05:33
1

@cedriv-julien, @sam-watkins, I think it's fine without the use of `float`, because of the `from __future__ import division` line, isn't it? – simon Apr 27 '14 at 18:53
Yeah, he changed that recently. I would prefer not to change the meaning of division throughout my program. And the original code at activestate lacks that line. – Sam Watkins May 11 '14 at 14:38
4

TypeError: unsupported operand type(s) for +: 'map' and 'list' – abg Jun 18 '17 at 16:40
1

This code is not valid for python 3 – Alg_D Mar 19 '19 at 23:31

score 8 · Answer 2 · answered Sep 18 '09 at 20:07

It's inherently not simple. There's no way of knowing for sure, although you can take a reasonably good guess in most cases.

Things you might like to do:

Look for known magic numbers in binary signatures
Look for the Unicode byte-order-mark at the start of the file
If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16
Otherwise, look for 0s in the file; a file with a 0 in is unlikely to be a single-byte-encoding text file.

But it's all heuristic - it's quite possible to have a file which is a valid text file and a valid image file, for example. It would probably be nonsense as a text file, but legitimate in some encoding or other...

score 7 · Answer 3 · answered Sep 18 '09 at 21:38

7

It might be possible to use libmagic to guess the MIME type of the file using python-magic. If you get back something in the "text/*" namespace, it is likely a text file, while anything else is likely a binary file.

answered Sep 18 '09 at 21:38

John Paulett

15,596
4
45
38

score 5 · Answer 4 · answered Sep 18 '09 at 20:05

5

If your script is running on *nix, you could use something like this:

import subprocess
import re

def is_text(fn):
    msg = subprocess.Popen(["file", fn], stdout=subprocess.PIPE).communicate()[0]
    return re.search('text', msg) != None

answered Sep 18 '09 at 20:05

Aoife

1,736
14
12

No need for `re` if just finding substring. – Steven Lu Jun 13 '13 at 00:22
Doesn't work if `text` is part of a binary's file filepath. – Paddre Feb 17 '15 at 23:28
2

I suggest Popen(["file", "--mime", fn]. ...). Otherwise the word "text" might not appear. On my Linux, the answer for something that looks like a Fortran program is "FORTAN program". If you add the mime switch you get "text/x-fortran; charset=us-ascii". – Tsf Mar 13 '15 at 18:44
If you're using Python 3 the `msg` will be bytes rather than a string, so you'd have to use `return re.search("text", msg.decode()) != None` or `return "text" in msg.decode()` instead. – Matt Pitkin Apr 29 '20 at 21:48

How to identify binary and text files using Python?

4 Answers4

Linked