Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file

Question

Regarding a solution for determining whether a file is binary or text in python, the answerer uses:

textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x100))

and then uses .translate(None, textchars) to remove (or replace by nothing) all such characters in a file read in as binary.

The answerer also argues that this choice of numbers is "based on file(1) behaviour" (for what's text and what's not). What is so significant about these numbers is determining text files from binary?

Martijn Pieters · Accepted Answer · 2015-08-24T15:51:41.730

They represent the most-common codepoints for printable text, plus newlines, spaces and carriage returns and the like. ASCII is covered up to 0x7F, and standards like Latin-1 or Windows Codepage 1251 use the remaining 128 bytes for accented characters, etc.

You'd expect text to only use those codepoints. Binary data would use all codepoints in the range 0x00-0xFF; e.g. a text file will probably not use \x00 (NUL) or \x1F (Unit Separator in the ASCII standard).

It is a heuristic at best, though. Some text files may still try and use C0 control codes outside those 7 characters explicitly named, and I'm sure binary data exists that happens to not include the 25 byte values not included in the textchars string.

The author of the range probably based it on the text_chars table from the file command. It marks bytes as non-text, ASCII, Latin-1 or non-ISO extended ASCII, and includes documentation on why those codepoints where chosen:

/*
 * This table reflects a particular philosophy about what constitutes
 * "text," and there is room for disagreement about it.
 *
 * [....]
 *
 * The table below considers a file to be ASCII if all of its characters
 * are either ASCII printing characters (again, according to the X3.4
 * standard, not isascii()) or any of the following controls: bell,
 * backspace, tab, line feed, form feed, carriage return, esc, nextline.
 *
 * I include bell because some programs (particularly shell scripts)
 * use it literally, even though it is rare in normal text.  I exclude
 * vertical tab because it never seems to be used in real text.  I also
 * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
 * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
 * character to.  It might be more appropriate to include it in the 8859
 * set instead of the ASCII set, but it's got to be included in *something*
 * we recognize or EBCDIC files aren't going to be considered textual.
 *
 * [.....]
 */

Interestingly enough, that table excludes 0x7F, which the code you found does not.

Indeed. It's still an assumption, and can yield both false positives and false negatives (binary data that happens to have only bytes within text range, or text data that happens to use esoteric characters). — spectras, Aug 24 '15 at 14:34
@dan04: I'm guessing that if this list of control codes really comes from `file(1)` then they are based on some statistics on text files, which would mean that vertical tabs are not used often enough in text files to make it a significant indicator. — Martijn Pieters, Aug 24 '15 at 14:42
@dan04: not that I have been able to locate this in the [file source code](https://github.com/file/file) yet. — Martijn Pieters, Aug 24 '15 at 14:44
@MartijnPieters Thanks for your help. Regarding the file(1) source code, can we see where it determines whether a file is text or binary? I've had a look into [encoding.c](https://github.com/file/file/blob/master/src/encoding.c) but being a novice in C I'm struggling to find anything. — andyandy, Aug 24 '15 at 15:41
@andyandy: I ran out of time earlier, but you found the right file. Basically [this table](https://github.com/file/file/blob/master/src/encoding.c#L208-L228) has those exact same 7 control codes marked as *appears in ASCII text*, and hex 0x20 through to 0xFF all are set to one of the ASCII, Latin1 and non-ISO extended flags. — Martijn Pieters, Aug 24 '15 at 15:46
@MartijnPieters - Thank you sooooo much! Clearly the problem of determining what constitutes text is not a trivial problem (although I like your idea using statistics on text files). — andyandy, Aug 24 '15 at 16:15

Python, file(1) - Why are the numbers [7,8,9,10,12,13,27] and range(0x20, 0x100) used for determining text vs binary file

1 Answers1

Linked