From the file
man page:
The type printed will usually contain one of the words text (the file
contains only printing characters and a few
common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the
result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything
else (data is usually ``binary'' or non-printable).
Seeing as you just want to determine if it's text or binary, I would just check if every character in the stream is printable
import string
all(c in string.printable for c in stream)
I don't think you will ever be able to get this 100% right, but this should be reasonably accurate. Do you need to handle unicode encodings though?
EDIT - Unicode support is a little tricky, but if you have a set of possible encodings then you could test if the document successfully decodes from each one, before checking if all of the characters are printable
import string
import unicodedata
encodings = 'ascii', 'utf-8', 'utf-16'
test_strings = '\xf0\x01\x01\x00\x44', 'this is a test', 'a utf-8 test \xe2\x98\x83'
def attempt_decode(s, encodings):
for enc in encodings:
try:
return s.decode(enc), enc
except UnicodeDecodeError:
pass
return s, 'binary'
def printable(s):
if isinstance(s, unicode):
return not any(unicodedata.category(c) in ['Cc'] for c in s)
return all(c in string.printable for c in s)
for s in test_strings:
result, enc = attempt_decode(s, encodings)
if enc != 'binary':
if not printable(result):
result, enc = s, 'binary'
print enc + ' - ' + repr(result)
This results in:
binary - '\xf0\x01\x01\x00D'
ascii - u'this is a test'
utf-8 - u'a utf-8 test \u2603'