I need to read through many files and search for specific text in them. I want to open only text files, i.e., no image, movie, etc. files. I am looking for a way to identify non-text files. Since I will be using a FileStream and doing a byte search, it seems to me I can stop reading and close a file if a byte whose decimal value is greater than 128 is encountered. Does this seem like a good approach?
-
1Can you filter files by extension? – Alex Jan 20 '11 at 18:51
-
are the extensions known? .txt, .doc, etc? – WernerCD Jan 20 '11 at 18:52
-
Your user can tell, easily, it isn't a text file when it looks like Chinese. Provide the message box with Yes/No. – Hans Passant Jan 20 '11 at 23:32
3 Answers
There's no foolproof answer for this. If you know that any text files will only ever be ASCII characters (and encoded in ASCII, UTF-8 or something similar) then yes, that will work... although it may not catch all non-text files.
However:
- It will fail for any text files using non-ASCII text
- It could still fail for a file which is a valid binary file for some format, but happens not to contain any values above 128.
Does the sequence of bytes { 34, 87, 23, 10 } represent text or binary data? There's simply no way of knowing for sure. Anything you do will be heuristic.

- 1,421,763
- 867
- 9,128
- 9,194
-
I want to disqualify a file if it is not plain ASCII text. The files in the folder could be anything and I have no advanced knowledge of what type of file I will be opening. Extensions are not reliable - a movie file could be renamed with a .txt extension. So if a non-ascii character is encountered, then it seems I should just reject the file and move on to the next one. what's wrong with that? – Bill Seacham Jan 20 '11 at 19:00
-
@Bill: Because you could still have a file which is actually binary data in some form but contains no bytes greater than 127... and I would personally hesitate to disqualify non-ASCII. Of course, I don't know your situation. If this is just for a tool where you can verify the results, it makes sense as a useful heuristic - but you need to be *very* aware of its limitations. – Jon Skeet Jan 20 '11 at 19:04
Not sure if this is a home grown application and you just want a quick and dirty solution.
If so you could make use of Path.GetExtension
string p = @"C:\Myfile.txt";
string e = Path.GetExtension(p);
if (e == ".txt")
{
//do stuff; process the file
}
Keep in mind that an extension does not dictate data type. This is only valuable if you can guarantee the extension type is representative of the data.

- 24,527
- 5
- 59
- 88
Can you just check the file extension if ".txt,.cvs" etc.?
The thing is you're going to have to know the encoding: How can I detect the encoding/codepage of a text file
-
NO -extension is no guarantee. Encoding is not relevant when searching with a Filestream. – Bill Seacham Jan 20 '11 at 19:13