How to determine whether a stream is text or binary in Python?

Question

Is there a way to determine ^{(test, check or classify)} whether a file^{(or a bytestream, or other file-like object)} is text or binary, similar to the file command's magic in Unix, in a practical majority of cases?

Motivation: Although guesswork should be avoided, where Python can determine this, I'd like to utilize the capability. One could cover a useful amount of cases and handle the exceptions.

Preference would be given to cross-platform or pure-python methods. One way is python-magic however it depends on Cygwin on Windows, and on libmagic in general.

@user2357112 please prove me wrong but I think this is not detection; rather the one who opened the file sets it. — n611x007, Apr 10 '14 at 11:19

Peter Gibson · Answer 1 · 2014-04-10T22:26:53.043

From the file man page:

The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ``binary'' or non-printable).

Seeing as you just want to determine if it's text or binary, I would just check if every character in the stream is printable

import string
all(c in string.printable for c in stream)

I don't think you will ever be able to get this 100% right, but this should be reasonably accurate. Do you need to handle unicode encodings though?

EDIT - Unicode support is a little tricky, but if you have a set of possible encodings then you could test if the document successfully decodes from each one, before checking if all of the characters are printable

import string
import unicodedata

encodings = 'ascii', 'utf-8', 'utf-16'

test_strings = '\xf0\x01\x01\x00\x44', 'this is a test', 'a utf-8 test \xe2\x98\x83'

def attempt_decode(s, encodings):
    for enc in encodings:
        try:
            return s.decode(enc), enc
        except UnicodeDecodeError:
            pass
    return s, 'binary'

def printable(s):
    if isinstance(s, unicode):
        return not any(unicodedata.category(c) in ['Cc'] for c in s)
    return all(c in string.printable for c in s)

for s in test_strings:
    result, enc = attempt_decode(s, encodings)
    if enc != 'binary':
        if not printable(result):
            result, enc = s, 'binary'
    print enc + ' - ' + repr(result)

This results in:

binary - '\xf0\x01\x01\x00D'
ascii - u'this is a test'
utf-8 - u'a utf-8 test \u2603'

well, I think the question's scope is OK to be ascii. practically utf8 would be preferable. — n611x007, Apr 09 '14 at 07:11
wow I forgot `unicodedata`, thank you! checked `.category` and it can return more than just 'Cc', here is what: http://www.unicode.org/reports/tr44/tr44-4.html#General_Category_Values — n611x007, Apr 10 '14 at 11:25
unclear: is the `in` inside `... in 'Cc'` useful? I think it will make a substring search but each possible [category](http://www.unicode.org/reports/tr44/tr44-4.html#General_Category_Values) seem to be a two-letter word. Although what about `... in ['Cc', ...]`? — n611x007, Apr 10 '14 at 11:29

How to determine whether a stream is text or binary in Python?

1 Answers1

Linked