61

I need to determine in 80% if a file is binary or text, is there any way to do it even quick and dirty/ugly in c#?

casperOne
  • 73,706
  • 19
  • 184
  • 253
Pablo Retyk
  • 5,690
  • 6
  • 44
  • 59

11 Answers11

35

There's a method called Markov Chains. Scan a few model files of both kinds and for each byte value from 0 to 255 gather stats (basically probability) of a subsequent value. This will give you a 64Kb (256x256) profile you can compare your runtime files against (within a % threshold).

Supposedly, this is how browsers' Auto-Detect Encoding feature works.

Andriy Volkov
  • 18,653
  • 9
  • 68
  • 83
25

I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.

As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.

Jeff Yates
  • 61,417
  • 20
  • 137
  • 189
Ron Warholic
  • 9,994
  • 31
  • 47
  • 8
    Thanks, I looked for 4 consecutived nulls "\0\0\0\0" binary files seem to have a lot of them so I tested it in 50 random files and it works. – Pablo Retyk May 26 '09 at 14:42
  • 2
    Four consecutive nulls failed to say some .png files were binary so I tried two consecutive nulls and that worked better. – Adam Bruss Oct 29 '12 at 15:55
  • 11
    If the text file is ASCII or UTF-8, finding _one_ zero byte should be enough to conclude it's not. This will fail for UTF-16 and UTF-32 files, but so will most text editors ;-) – John Dvorak Mar 15 '13 at 08:41
18

Sharing my solution in the hope it helps others as it helps me from these posts and forums.

Background

I have been researching and exploring a solution for the same. However, I expected it to be simple or slightly twisted.

However, most of the attempts provide convoluted solutions here as well as other sources and dives into Unicode, UTF-series, BOM, Encodings, Byte orders. In the process, I also went off-road and into Ascii Tables and Code pages too.

Anyways, I have come up with a solution based on the idea of stream reader and custom control characters check.

It is built taking into considerations various hints and tips provided on the forum and elsewhere such as:

  1. Check for lot of control characters for example looking for multiple consecutive null characters.
  2. Check for UTF, Unicode, Encodings, BOM, Byte Orders and similar aspects.

My goal is:

  1. It should not rely on byte orders, encodings and other more involved esoteric work.
  2. It should be relatively easy to implement and easy to understand.
  3. It should work on all types of files.

The solution presented works for me on test data that includes mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. It gives results as expected so far.

How the solution works

I am relying on the StreamReader default constructor to do what it can do best with respect to determining file encoding related characteristics which uses UTF8Encoding by default.

I created my own version of check for custom control char condition because Char.IsControl does not seem useful. It says:

Control characters are formatting and other non-printing characters, such as ACK, BEL, CR, FF, LF, and VT. Unicode standard assigns code points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to control characters. These values are to be interpreted as control characters unless their use is otherwise defined by an application. It considers LF and CR as control characters among other things

That makes it not useful since text files include CR and LF at least.

Solution

static void testBinaryFile(string folderPath)
{
    List<string> output = new List<string>();
    foreach (string filePath in getFiles(folderPath, true))
    {
        output.Add(isBinary(filePath).ToString() + "  ----  " + filePath);
    }
    Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text);
}

public static List<string> getFiles(string path, bool recursive = false)
{
    return Directory.Exists(path) ?
        Directory.GetFiles(path, "*.*",
        recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() :
        new List<string>();
}    

public static bool isBinary(string path)
{
    long length = getSize(path);
    if (length == 0) return false;

    using (StreamReader stream = new StreamReader(path))
    {
        int ch;
        while ((ch = stream.Read()) != -1)
        {
            if (isControlChar(ch))
            {
                return true;
            }
        }
    }
    return false;
}

public static bool isControlChar(int ch)
{
    return (ch > Chars.NUL && ch < Chars.BS)
        || (ch > Chars.CR && ch < Chars.SUB);
}

public static class Chars
{
    public static char NUL = (char)0; // Null char
    public static char BS = (char)8; // Back Space
    public static char CR = (char)13; // Carriage Return
    public static char SUB = (char)26; // Substitute
}

If you try above solution, let me know it works for you or not.

Other interesting and related links:

Community
  • 1
  • 1
bhavik shah
  • 573
  • 5
  • 12
  • The getSize function is missing. Thanks for the code. The important bits were used and testing so far seems to go well. – Atron Seige Sep 17 '15 at 05:06
  • I actually like that this solution does not read the entire file. It makes it much easier to run a tool that observes an entire directory which may contain 50-MB videos. – Katana314 Dec 04 '15 at 17:26
  • @AtronSeige you can use `new FileInfo(path).Length` to get the file size. – Jeremy Cook Nov 21 '16 at 15:37
  • It's help to conform the encoding too.I write the tool to conform the encoding in https://marketplace.visualstudio.com/items?itemName=lindexigd.vs-extension-18109 that use your solution. – lindexi Jan 19 '17 at 03:11
  • Thanks. Worked except one case. I took an XML file, opened it in Notepad, and saved as Unicode (also added some foreign characters). I'm storing file in blog or text field of a MySQL data column, then later writing it back to disk. – NealWalters Apr 14 '17 at 15:34
  • thanks, did work for me for the problem I was having. Files are being saved to a network drive and occasionally get filled with all null characters. – Aaron Aug 03 '17 at 04:29
15

While this isn't foolproof, this should check to see if it has any binary content.

public bool HasBinaryContent(string content)
{
    return content.Any(ch => char.IsControl(ch) && ch != '\r' && ch != '\n');
}

Because if any control character exist (aside from the standard \r\n), then it is probably not a text file.

McKay
  • 12,334
  • 7
  • 53
  • 76
10

If the real question here is "Can this file be read and written using StreamReader/StreamWriter without modification?", then the answer is here:

/// <summary>
/// Detect if a file is text and detect the encoding.
/// </summary>
/// <param name="encoding">
/// The detected encoding.
/// </param>
/// <param name="fileName">
/// The file name.
/// </param>
/// <param name="windowSize">
/// The number of characters to use for testing.
/// </param>
/// <returns>
/// true if the file is text.
/// </returns>
public static bool IsText(out Encoding encoding, string fileName, int windowSize)
{
    using (var fileStream = File.OpenRead(fileName))
    {
    var rawData = new byte[windowSize];
    var text = new char[windowSize];
    var isText = true;

    // Read raw bytes
    var rawLength = fileStream.Read(rawData, 0, rawData.Length);
    fileStream.Seek(0, SeekOrigin.Begin);

    // Detect encoding correctly (from Rick Strahl's blog)
    // http://www.west-wind.com/weblog/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader
    if (rawData[0] == 0xef && rawData[1] == 0xbb && rawData[2] == 0xbf)
    {
        encoding = Encoding.UTF8;
    }
    else if (rawData[0] == 0xfe && rawData[1] == 0xff)
    {
        encoding = Encoding.Unicode;
    }
    else if (rawData[0] == 0 && rawData[1] == 0 && rawData[2] == 0xfe && rawData[3] == 0xff)
    {
        encoding = Encoding.UTF32;
    }
    else if (rawData[0] == 0x2b && rawData[1] == 0x2f && rawData[2] == 0x76)
    {
        encoding = Encoding.UTF7;
    }
    else
    {
        encoding = Encoding.Default;
    }

    // Read text and detect the encoding
    using (var streamReader = new StreamReader(fileStream))
    {
        streamReader.Read(text, 0, text.Length);
    }

    using (var memoryStream = new MemoryStream())
    {
        using (var streamWriter = new StreamWriter(memoryStream, encoding))
        {
        // Write the text to a buffer
        streamWriter.Write(text);
        streamWriter.Flush();

        // Get the buffer from the memory stream for comparision
        var memoryBuffer = memoryStream.GetBuffer();

        // Compare only bytes read
        for (var i = 0; i < rawLength && isText; i++)
        {
            isText = rawData[i] == memoryBuffer[i];
        }
        }
    }

    return isText;
    }
}
7

Great question! I was surprised myself that .NET does not provide an easy solution for this.

The following code worked for me to distinguish between images (png, jpg etc) and text files.

I just checked for consecutive nulls (0x00) in the first 512 bytes, as per suggestions by Ron Warholic and Adam Bruss:

if (File.Exists(path))
{
    // Is it binary? Check for consecutive nulls..
    byte[] content = File.ReadAllBytes(path);
    for (int i = 1; i < 512 && i < content.Length; i++) {
        if (content[i] == 0x00 && content[i-1] == 0x00) {
            return Convert.ToBase64String(content);
        }
    }
    // No? return text
    return File.ReadAllText(path);
}

Obviously this is a quick-and-dirty approach, however it can be easily expanded by breaking the file into 10 chunks of 512 bytes each and check 8 one of the them for consecutive nulls (personally, I would deduce its a binary file if 2 or 3 of them match - nulls are rare in text files).

That should provide a pretty good solution for what you are after.

Steven de Salas
  • 20,944
  • 9
  • 74
  • 82
4

Quick and dirty is to use the file extension and look for common, text extensions such as .txt. For this, you can use the Path.GetExtension call. Anything else would not really be classed as "quick", though it may well be dirty.

Jeff Yates
  • 61,417
  • 20
  • 137
  • 189
  • 4
    Sometimes guys like me can change the extension of a binary file to .txt – Kirtan May 26 '09 at 14:11
  • Obviously, but he asked for cheap and dirty - there's no foolproof way but to ask a person to read it. – Jeff Yates May 26 '09 at 14:19
  • that's good, unfortunatelly I'm not dealing with common extensions, I'm writing some kind of list of all files and need to categorize them bin or text, most people do it but hand but as I am lazy I prefer to write code. – Pablo Retyk May 26 '09 at 14:31
  • Many people export "excel files" with an .xls extension which are actually csv-files or html-files. – Tim Schmelter Nov 15 '13 at 15:13
  • @TimSchmelter: I said quick and dirty, not foolproof and 100% effective. :) – Jeff Yates Nov 15 '13 at 16:31
  • @JeffYates: I know, but most people(like Kirtan) think that the extension approach is just a problem if someone tries to upload an exe as txt or so. Many files are not what they supposed to be even under normal circumstances. – Tim Schmelter Nov 15 '13 at 16:35
2

A really really really dirty way would be to build a regex that takes only standard text, punctuation, symbols, and whitespace characters, load up a portion of the file in a text stream, then run it against the regex. Depending on what qualifies as a pure text file in your problem domain, no successful matches would indicate a binary file.

To account for unicode, make sure to mark the encoding on your stream as such.

This is really suboptimal, but you said quick and dirty.

Chad Ruppert
  • 3,650
  • 1
  • 19
  • 19
1

Another way is to detect the file's charset using UDE. If charset detected successfully, you can be sure that it's text, otherwise it's binary. Because binary has no charset.

Of course you can use other charset detecting library other than UDE. If the charset detecting library is good enough, this approach could achieve 100% correctness.

Tyler Liu
  • 19,552
  • 11
  • 100
  • 84
1

How about another way: determine length of binary array, representing file's contents and compare it with length of string you will have after converting given binary array to text.

If length the same, there are no "none-readable' symbols in file, it's text (I'm sure on 80%).

shytikov
  • 9,155
  • 8
  • 56
  • 103
  • That of course depends on the encoding used. –  Mar 05 '12 at 21:33
  • 1
    And converting a random-length file into a byte array and converting to string could easily use extreme amounts of resources. Just think about a 2Gb log file (which is definitely a text file)... if you want to compare it to its unconverted version, you have to reserve more than 4 Gb of memory, than compare it page-by-page... that is not even quick. – mg30rg Aug 08 '13 at 11:50
1

http://codesnipers.com/?q=node/68 describes how to detect UTF-16 vs. UTF-8 using a Byte Order Mark (which may appear in your file). It also suggests looping through some bytes to see if they conform to the UTF-8 multi-byte sequence pattern (below) to determine if your file is a text file.

  • 0xxxxxxx ASCII < 0x80 (128)
  • 110xxxxx 10xxxxxx 2-byte >= 0x80
  • 1110xxxx 10xxxxxx 10xxxxxx 3-byte >= 0x400
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte >= 0x10000
foson
  • 10,037
  • 2
  • 35
  • 53
  • This works if the file is guaranteed to be UTF8/16, or binary. But what if it is neither? What if it is a Text file, encoded in neither ASCII nor UTF-8/16. What if it is encoded in the Big5 code page? Or ISO-8859-1? These have no BOM. So... how to cover that case as well? – Cheeso May 26 '09 at 14:56
  • If the file is (US-)ASCII it is in fact UTF-8, because characters with a 7-bit character code are translated to themselves in UTF-8, but if it is made in some localized ANSI code page, it will still be recognized as binary by the above method. – mg30rg Aug 08 '13 at 11:43