How to identify the file content as ASCII or binary

Question

How do you identify the file content as being in ASCII or binary using C++?

Other question is better, so suggest closing this one. Duplicates http://stackoverflow.com/questions/567757/how-do-i-distinguish-between-binary-and-text-files — Mechanical snail, Aug 06 '12 at 07:22

Daniel Cassidy · Answer 1 · 2008-11-10T14:55:10.290

If a file contains only the decimal bytes 9–13, 32–126, it's probably a pure ASCII text file. Otherwise, it's not. However, it may still be text in another encoding.

If, in addition to the above bytes, the file contains only the decimal bytes 128–255, it's probably a text file in an 8-bit or variable-length ASCII-based encoding such as ISO-8859-1, UTF-8 or ASCII+Big5. If not, for some purposes you may be able to stop here and consider the file to be binary. However, it may still be text in a 16- or 32-bit encoding.

If a file doesn't meet the above constraints, examine the first 2–4 bytes of the file for a byte-order mark:

If the first two bytes are hex FE FF, the file is tentatively UTF-16 BE.
If the first two bytes are hex FF FE, and the following two bytes are not hex 00 00 , the file is tentatively UTF-16 LE.
If the first four bytes are hex 00 00 FE FF, the file is tentatively UTF-32 BE.
If the first four bytes are hex FF FE 00 00, the file is tentatively UTF-32 LE.

If, through the above checks, you have determined a tentative encoding, then check only for the corresponding encoding below, to ensure that the file is not a binary file which happens to match a byte-order mark.

If you have not determined a tentative encoding, the file might still be a text file in one of these encodings, since the byte-order mark is not mandatory, so check for all encodings in the following list:

If the file contains only big-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 BE.
If the file contains only little-endian two-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-16 LE.
If the file contains only big-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 BE.
If the file contains only little-endian four-byte words with the decimal values 9–13, 32–126, and 128 or above, the file is probably UTF-32 LE.

If, after all these checks, you still haven't determined an encoding, the file isn't a text file in any ASCII-based encoding I know about, so for most purposes you can probably consider it to be binary (it might still be a text file in a non-ASCII encoding such as EBCDIC, but I suspect that's well outside the scope of your concern).

This only works if the text is ASCII. If UTF16 or UTF32, then it may contain bytes with values 0-8, 14-31 and 127. Your answer is therefore confusing. — David Arno, Nov 10 '08 at 11:03
@David Arno, That's true, but the question was actually about ASCII or not. — quinmars, Nov 10 '08 at 12:33
@quinmars, I draw your attention to the first line of this answer "I assume you really want to detect if a file is text (in any encoding), not just ASCII.". Given that, the second line is plain wrong. Thus the answer is confused and misleading. — David Arno, Nov 10 '08 at 12:43
@David Arno: I agree, so I've edited my answer to reflect your comments. Thanks :). — Daniel Cassidy, Nov 10 '08 at 14:55
Sorry Daniel, but the system won't let me undo my downvote, which is ridiculous as you've edited it to make it a really good answer :( — David Arno, Nov 10 '08 at 20:51

Johannes Schaub - litb · Answer 2 · 2008-11-10T10:33:03.347

17

You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127. One way of many ways to do it:

int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127) 
    ;
if(c == EOF) {
    /* file is all ASCII */
}

However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.

edited Nov 10 '08 at 10:33

answered Nov 10 '08 at 10:26

Johannes Schaub - litb

496,577
130
894
1,212

1

I don't think that is what the author intended. But *factually* this is the correct answer. :-) – Tomalak Nov 10 '08 at 10:32
It is the correct answer to the question asked. However Tomalak you are right in that san probably hasn't phrased the question properly. – David Arno Nov 10 '08 at 10:34
1

I the expression "ASCII or binary" is a hint that he really means "text, as opposed to binary". – Tomalak Nov 10 '08 at 10:46
BTW: "Alphanumeric" is only a sub-set of text. – Tomalak Nov 10 '08 at 10:47
yes. maybe he wanted that. but maybe he also wants to have '[' included... one never knows :) – Johannes Schaub - litb Nov 10 '08 at 11:11
This code will mark as ASCII some files that aren't. You want to check 'c' as follows: c >= ' ' && c <=127 || c=='\n' || c=='\r' || c== '\t' Adjust according to your idea of ASCII and your platform. Some possible additions woudl be '\f' and '\v'. – Nov 10 '08 at 15:19
"man ascii" shows me 128 ascii characters, including values below ' ' (0x20) – Johannes Schaub - litb Nov 10 '08 at 15:57
@Arkadiy, ASCII covers 0 - 127, which includes all of the control characters. ASCII is NOT just the printable characters. – David Arno Nov 10 '08 at 20:54
What about french characters? é -> 130 – Cédric Guillemette Feb 12 '09 at 15:11
@Cedrik well it is not standard ascii, ascii is american standard code for information interchange after all :(. If you also include values > 128 in your test it makes every file a text file (as every file is composed of bythes ... between 0 and 255). – Ben Apr 14 '09 at 19:34

score 12 · Answer 3 · answered Nov 10 '08 at 10:54

12

My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.

answered Nov 10 '08 at 10:54

bart

7,640
3
33
40

2

This is what gnu diff does as well. Except they only look at a predefined length into the file. (Don't want to skim a 4GB file for null bytes...) – Bill Lynch Apr 14 '09 at 19:24
This is also what "grep -I" does. – kervin Jul 03 '09 at 17:52

score 10 · Answer 4 · answered Nov 10 '08 at 10:24

The contents of every file is binary. So, knowing nothing else, you can't be sure.

ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.

Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.

Unless you define your question more closely, there can't be a definitive answer.

philant · Answer 5 · 2009-04-14T19:13:32.290

10

Have a look a how the file command works ; it has three strategies to determine the type of a file:

filesystem tests
magic number tests
and language tests

Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.

edited Apr 14 '09 at 19:13

answered Nov 10 '08 at 10:43

philant

34,748
11
69
112

score 8 · Answer 6 · answered Nov 10 '08 at 11:35

If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.

Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.

If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.

If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.

score 1 · Answer 7 · answered Nov 10 '08 at 10:22

Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.

You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.

score 1 · Answer 8 · answered Nov 10 '08 at 10:55

To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary. After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters. You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!). If you need them, you'll have to define.

const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;

score 1 · Answer 9 · answered Nov 18 '08 at 06:06

This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.

Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.

score 0 · Answer 10 · edited Jun 15 '19 at 18:36

0

Github's linguist uses charlock holmes library to detect binary files, which in turn uses ICU's charset detection.

The ICU library is available for many programming languages, including C and Java.

edited Jun 15 '19 at 18:36

eraxillan

1,552
1
19
40

answered Mar 23 '15 at 17:25

cweiske

30,033
14
133
194

score -1 · Answer 11 · answered Jul 30 '21 at 03:36

bool checkFileASCIIFormat(std::string fileName)
{
    bool ascii = true;
    std::ifstream read(fileName);
    int line;
    while ((ascii) && (!read.eof())) {
        line = read.get();
        if (line > 127) {
            //ASCII codes only go up to 127
            ascii = false;
        }
    }

    return ascii;
}

How to identify the file content as ASCII or binary

11 Answers11

Linked

Related