4

How can I detect if a file is binary or a plain text?

Basically my .NET app is processing batch files and extracting data however I don't want to process binary files.

As a solution I'm thinking about analysing first X bytes of the file and if there are more unprintable characters than printable characters it should be binary.

Is this the right way to do it? Is there any better implementation for this task?

Federico klez Culloca
  • 26,308
  • 17
  • 56
  • 95
dr. evil
  • 26,944
  • 33
  • 131
  • 201
  • 1
    Your method is pretty much how I would do it. I'd be scanning for lots of \n's, but the same idea. – Michael Dorgan May 27 '10 at 17:20
  • 1
    Look at http://stackoverflow.com/questions/567757/how-do-i-distinguish-between-binary-and-text-files or at http://stackoverflow.com/questions/277521/how-to-identify-the-file-content-is-in-ascii-or-binary - these are the same questions, except not specialized for .NET, I think most of what you want to know is answered there already. – schnaader May 27 '10 at 17:22
  • What kind of processing are you doing? – Lasse V. Karlsen May 27 '10 at 17:22
  • @Lasse it's extracting piece of text (I've got 3-5 different patterns), so if I hit binary that means lots of processing power in binary format and try match stuff. – dr. evil May 27 '10 at 17:30
  • @schnaader I searched for it I think because of my ignore list! couldn't find any of those – dr. evil May 27 '10 at 17:30

4 Answers4

6

What exactly do you mean by binary? Is the 'Art of War' written in Chinese binary to you? What about a Japanese-English dictionary?

There is no really 100% way.

You would need to use some kind of heuristic.

Some options might be to look at:

If the above (especially file signatures and extensions) don't help, then try to guess based on the presence/absence of certains bytes (like you are doing).

Note: It is better to check extensions/signatures first, as you would only need to read a few bytes/file metadata and that would be pretty efficient as compared to actually reading the whole file.

  • 2
    This is the reason I asked the question :) – dr. evil May 27 '10 at 18:36
  • Metadata reading is too much though you need a signature database etc. and for my task totally over engineering it. – dr. evil May 27 '10 at 18:37
  • @dr. evil. A file extension check would not be reasonable? I consider that file metadata. Anyway, I guess you have enough info to get on with your work :-) –  May 27 '10 at 19:18
  • As you said I think I've got enough info to start it, shame there is no easy to use .NET library for this purpose. – dr. evil May 28 '10 at 09:33
5

Unix file command does this in a clever way. Of course, it does a lot more, but you can check the algorithm here and then build something specialized.


UPDATE: The link above seems to be broken. Try this.

Bruno Brant
  • 8,226
  • 7
  • 45
  • 90
  • 1
    Is this really applicable to a .Net app running on windows environment? –  May 27 '10 at 17:47
  • 1
    @Moron: yes, because `file` doesn't use OS-provided information to determine file type. It's just looking at BOM, magic numbers, content heuristics, etc as mentioned variously in the other answers. – Derrick Turk May 27 '10 at 18:10
  • @Derrick: What I meant was, does it detect files commonly found on Windows machines, say found on Windows Vista/ Windows 7? In any case, just pointing someone to the source code of 'file' is not really helpful. –  May 27 '10 at 18:16
  • @Moron: Sorry, but to provide a complete implementation of such algorithm would take a lot of time. `file` **is** system agnostic in its algorithms, although the source file is not. I think that anyone who can read C# can understand a bit of C code (since they are similar) so I thought you'd have no trouble finding the part of the source that was relevant to you. `file` is very reliable, and can tell you what you want (binary vs. plain-text) most accurately. – Bruno Brant May 28 '10 at 18:54
  • True, an off the shelf solution will be better than implementing on your own. Pointing to a unix C implementation of file does not help in that regard, though. If you noticed, I don't disagree strongly enough to give your answer a -1 :-) –  May 28 '10 at 19:53
1

I think the best way of doing this is to take at most the first X bytes from the file (X could be 256, 512, etc), count the number of chars that are not used by ASCII files (ascii codes permitted are: 10, 13, 32-126). If you know for sure that the script is written in English, than no character can be outside of the mentioned set. If you are not sure about the language, than you may permit at most Y char to be outside of the set (if X is 512, I would choose Y to be 8 or 10).

If this is not good enough, you may use more constraints such as: depending on the syntax of the files, such keywords should be present (eg: for your batch files, there should be some echo, for, if, goto, call, exit, etc)

botismarius
  • 2,977
  • 2
  • 30
  • 29
0

You could regex the first X number of bytes, and give a valid match if all bytes are in a proper character class. But that might presuppose that you know the encoding.

Brent Arias
  • 29,277
  • 40
  • 133
  • 234