3

Possible Duplicate:
How can I determine if a file is binary or text in c#?

Without consider the filename (the extension), using only the content, we need to know if a file is text or binary. I can’t use the extension because I don’t know all the text file extensions, and because a text file can be without extension.

I was doing it looking for the percentage of the non -ASCII bytes in the first part of the file. I cannot read the full file each time for performance reasons. I was using the following code:

private static bool IsBinary(byte[] bytes, int maxLength)
{
    int len = maxLength > 1024 ? 1024 : maxLength;

    int nonASCIIcount = 0;

    for( int i = 0; i < len; ++i )
        if( bytes[i] > 127 )
            ++nonASCIIcount;

    // if the number of non ASCII is more than a 30%
    // then is a binary file. 
    return (nonASCIIcount / len) > 0.3;
}

The problem is that some kinds of files are wrongly detected as text because the first part of the file is text like photoshop files.

Any suggestion?

Community
  • 1
  • 1
Borja
  • 2,188
  • 1
  • 18
  • 21
  • you could do a random sampling throughout the file to see if each is an allowed text character. – Greg Bogumil Jan 21 '11 at 14:00
  • 3
    This was already discussed in this Thread: http://stackoverflow.com/questions/910873/how-can-i-determine-if-a-file-is-binary-or-text-in-c – fjdumont Jan 21 '11 at 14:01
  • 1
    What about UTF-8 encoded text files? Do you want to consider those as well? – Darin Dimitrov Jan 21 '11 at 14:02
  • If the auto-suggestion system didn't find *that* duplicate, it's obviously broken. – Cody Gray - on strike Jan 21 '11 at 14:03
  • How much text might be in the files? If there is a lot, maybe you could try and convert the byte array to ASCII text using `System.Text.Encoding.ASCII.GetString` and search for a word (like 'the' or something). – SwDevMan81 Jan 21 '11 at 14:03
  • Plain text != ASCII. Or can you be sure this code will never see any text except ASCII? –  Jan 21 '11 at 14:06
  • I think detecting text files in codepages used for Russion or Chinese texts are very hard to detect since they have many characters >127. – CodesInChaos Jan 21 '11 at 14:06
  • And how do you define binary vs text? Like FTP or do you want to know if you can display it as text? – CodesInChaos Jan 21 '11 at 14:11
  • I want to know if a file is a text file although the file is not an ASCII file. – Borja Jan 21 '11 at 14:16

3 Answers3

2

You cannot say that it's text based on percentage. Only way is to check if ANYTHING is non-ASCII, if yes then treat as binary. So your code should be:

bool IsBinary()
{
  for (int i = 0; i < bytes.Length; i++ )
    if (bytes[i] > 127)
      return true;
  return false;
}

EDIT: Also, maybe you should have a look at MIME type of file if it is avaliable to you.

Migol
  • 8,161
  • 8
  • 47
  • 69
  • I need to support the non-ASCII files, so the file can have a byte bigger than 127 – Borja Jan 21 '11 at 14:21
  • Then there is NO way to make sure that file is or is not a binary file because non-ASCII files are in fact binary files. – Migol Jan 21 '11 at 16:09
1

It depends on the content and probable text encoding of the whole file, anything else is not reliable. Also you shouldn't check >127 but instead <32 (0x20) and not equal to 0x0a or 0x0d (new-line and carriage-return) for plain ASCII files. If the encoding might be UTF8 it's more complex, it might work to try to read it in as UTF8 and if it fails, pretend it's binary.

-1

You don't want to include "control characters" as text data. Text files never include characters whose ascii code is less than 32.

kynnysmatto
  • 3,665
  • 23
  • 29