Detect if file contains text

Question

Possible Duplicate:
How can I determine if a file is binary or text in c#?
C# - Check if File is Text Based

To better understand multi threading and asynchronous tasks, I wrote a simple application in C# to count the total number of lines of code in a project (directory).

Currently, I open a file and count the number of lines in each file. However, that includes all files (jpg, png, exe etc.). Is there a way I can detect if a file is a text file? Possibly by detecting ASCII Encoding or something similar.

The question is indeed a duplicate but I think it's worth keeping it opened because the other 2 mentioned don't have excellent answers IMHO. — Serge Wautier, Nov 30 '11 at 07:37

score 2 · Accepted Answer · edited May 23 '17 at 12:03

Generally, you cannot reliably detect if the file is a text file. It starts with the general issue, what actually is "a text file". You already hinted at encodings, but especially those cannot be reliably detected (for example see Notepad's struggle).

Having that said, you might be able to employ the heuristics to do you best (including, but of course not limited to file extensions; excluding well known non-file types like EXE, DLL, ZIP, image files, by recognizing their signature; maybe combined with the approach used by browsers or Notepad).

Depending on your application, I guess it would be pretty much feasibly, to just let the user select the files to scan (maybe having a default list of extensions to include, like *.cs, *.txt, *.resx, *.xml, ...). If a file(type) / extension is not in the default list and was not added by the user, it is not counted. If the user adds a filetype/extension to the list that is not a "text file", the results are not useful.

But comparing effort and the fact that an automatic result will never be 100% exact (at detecting all possible files) it should be good enough.

Please note that the second link is broken. Here is a link to the archived page: http://web.archive.org/web/20131025185229/http://blogs.msdn.com/b/michkap/archive/2007/04/22/2239345.aspx — Kevin Vuilleumier, Sep 16 '14 at 13:13
@kevinvuilleumier thanks. I updated the link to Michael Kaplan's new blog. Luckily, there is no need for archive.org. — Christian.K, Sep 16 '14 at 18:30

score 1 · Answer 2 · answered Nov 30 '11 at 07:21

Testing for JPG, PNG, EXE would be expensive if you really want to consider whether it's binary or text. For JPG you have to run some JPEG algorithm and that goes for PNG. And for EXE it would be different.

One way to test zero byte in a binary file and people often consider a throttle percentage for number of zero byte for a file.

My suggestion would be to rely on extension solely. There would be very negligible case where a text file will be named by .JPG/.PNG/.EXE extension.

Please see this file list extension and list the text file extensions like .txt, .log, .html, .php, .asp etc.

score 0 · Answer 3 · answered Nov 30 '11 at 07:33

FWIW, there is a lib called MLang in Internet Explorer (iow in Windows) that features encoding detection. You can probably use it to simply detect if file is text vs binary.

Here's an excellent C# wrapper:

http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

That said, others' suggestion to use a file extension list (and maybe a signature list) should be enough.

Detect if file contains text

3 Answers3