20

How can I test whether a file that I'm opening in C# using FileStream is a "text type" file? I would like my program to open any file that is text based, for example, .txt, .html, etc.

But not open such things as .doc or .pdf or .exe, etc.

Sev
  • 15,401
  • 9
  • 56
  • 75

6 Answers6

11

In general: there is no way to tell.

A text file stored in UTF-16 will likely look like binary if you open it with an 8-bit encoding. Equally someone could save a text file as a .doc (it is a document).

While you could open the file and look at some of the content all such heuristics will sometimes fail (eg. notepad tries to do this, by careful selection of a few characters notepad will guess wrong and display completely different content).

If you have a specific scenario, rather than being able to open and process anything, you should be able to do much better.

Richard
  • 106,783
  • 21
  • 203
  • 265
9

I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?

Whatever you do is going to be a guess.

Guy
  • 3,353
  • 24
  • 28
  • and maybe check for regular intervals of spaces and newlines – Wouter Jan 20 '11 at 08:50
  • 4
    UPDATE: These days, a text file is more likely to be UTF-8, or similar encoding of Unicode characters. Look for newer answers. For example, find out how Notepad++ decides a file is text. – ToolmakerSteve Apr 05 '19 at 10:46
  • A lot of applications check for NUL chars to determine if a file is binary. Git being one example. See my full answer below. – bytedev Jul 01 '21 at 01:04
8

As others have pointed out there is no absolute way to be sure. However, to determine if a file is binary (which can be said to be easier than determining if it is text) some implementations check for consecutive NUL characters. Git apparently just checks the first 8000 chars for a NUL and if it finds one treats the file as binary. See here for more details.

Here is a similar C# solution I wrote that looks for a given number of required consecutive NUL. If IsBinary returns false then it is very likely your file is text based.

public bool IsBinary(string filePath, int requiredConsecutiveNul = 1)
{
    const int charsToCheck = 8000;
    const char nulChar = '\0';

    int nulCount = 0;

    using (var streamReader = new StreamReader(filePath))
    {
        for (var i = 0; i < charsToCheck; i++)
        {
            if (streamReader.EndOfStream)
                return false;

            if ((char) streamReader.Read() == nulChar)
            {
                nulCount++;

                if (nulCount >= requiredConsecutiveNul)
                    return true;
            }
            else
            {
                nulCount = 0;
            }
        }
    }

    return false;
}
bytedev
  • 8,252
  • 4
  • 48
  • 56
3

To get the real type of a file, you must check its header, which won't be changed even the extension is modified. You can get the header list here, and use something like this in your code:

using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
   using(var reader = new BinaryReader(stream))
   {
     // read the first X bytes of the file
     // In this example I want to check if the file is a BMP
     // whose header is 424D in hex(2 bytes 6677)
     string code = reader.ReadByte().ToString() + reader.ReadByte().ToString();
     if (code.Equals("6677"))
     {
        //it's a BMP file
     }
   }
}
Amarus
  • 137
  • 10
Cheng Chen
  • 42,509
  • 16
  • 113
  • 174
  • 2
    I'm interested in your link to "here", but it's broken. Do you know what URL that was, or what the new one is?? – JustBeingHelpful Feb 25 '12 at 19:19
  • @MacGyver: Sorry the broken link is out of my control. I came across to this solution in someone else's post. – Cheng Chen Feb 26 '12 at 10:09
  • 3
    This approach is flawed. There is no way to differentiate between a BMP file and a file that starts with the characters 'BM', which is more than likely to happen. – mafu Apr 27 '12 at 12:16
  • 1
    As a practical matter, this should be extended to examine more than just the first two characters, to decide the "likelihood" that a file lacks a header, and is actually some other format. For example, if you read all bytes of file, and none of them have sign bit set, its highly likely to be ASCII text. Deciding whether its actually a UTF-8 text file without a BOM is much trickier (not possible to do perfectly). – ToolmakerSteve Apr 05 '19 at 10:58
0

I have a below solution which works for me.This is general solution which check all types of Binary file.

     /// <summary>
     /// This method checks whether selected file is Binary file or not.
     /// </summary>     
     public bool CheckForBinary()
     {

             Stream objStream = new FileStream("your file path", FileMode.Open, FileAccess.Read);
             bool bFlag = true;

             // Iterate through stream & check ASCII value of each byte.
             for (int nPosition = 0; nPosition < objStream.Length; nPosition++)
             {
                 int a = objStream.ReadByte();

                 if (!(a >= 0 && a <= 127))
                 {
                     break;            // Binary File
                 }
                 else if (objStream.Position == (objStream.Length))
                 {
                     bFlag = false;    // Text File
                 }
             }
             objStream.Dispose();

             return bFlag;                   
     }
Neel Maheta
  • 329
  • 1
  • 4
  • 13
-1
public bool IsTextFile(string FilePath)
  using (StreamReader reader = new StreamReader(FilePath))
  {
       int Character;
       while ((Character = reader.Read()) != -1)
       {
           if ((Character > 0 && Character < 8) || (Character > 13 && Character < 26))
           {
                    return false; 
           }
       }
  }
  return true;
}
Hero
  • 177
  • 2
  • 12