2

I need to read through many files and search for specific text in them. I want to open only text files, i.e., no image, movie, etc. files. I am looking for a way to identify non-text files. Since I will be using a FileStream and doing a byte search, it seems to me I can stop reading and close a file if a byte whose decimal value is greater than 128 is encountered. Does this seem like a good approach?

3 Answers3

3

There's no foolproof answer for this. If you know that any text files will only ever be ASCII characters (and encoded in ASCII, UTF-8 or something similar) then yes, that will work... although it may not catch all non-text files.

However:

  • It will fail for any text files using non-ASCII text
  • It could still fail for a file which is a valid binary file for some format, but happens not to contain any values above 128.

Does the sequence of bytes { 34, 87, 23, 10 } represent text or binary data? There's simply no way of knowing for sure. Anything you do will be heuristic.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I want to disqualify a file if it is not plain ASCII text. The files in the folder could be anything and I have no advanced knowledge of what type of file I will be opening. Extensions are not reliable - a movie file could be renamed with a .txt extension. So if a non-ascii character is encountered, then it seems I should just reject the file and move on to the next one. what's wrong with that? – Bill Seacham Jan 20 '11 at 19:00
  • @Bill: Because you could still have a file which is actually binary data in some form but contains no bytes greater than 127... and I would personally hesitate to disqualify non-ASCII. Of course, I don't know your situation. If this is just for a tool where you can verify the results, it makes sense as a useful heuristic - but you need to be *very* aware of its limitations. – Jon Skeet Jan 20 '11 at 19:04
0

Not sure if this is a home grown application and you just want a quick and dirty solution.

If so you could make use of Path.GetExtension

    string p = @"C:\Myfile.txt";
    string e = Path.GetExtension(p);
    if (e == ".txt")
    {
       //do stuff; process the file
    }

Keep in mind that an extension does not dictate data type. This is only valuable if you can guarantee the extension type is representative of the data.

Aaron McIver
  • 24,527
  • 5
  • 59
  • 88
0

Can you just check the file extension if ".txt,.cvs" etc.?

The thing is you're going to have to know the encoding: How can I detect the encoding/codepage of a text file

Community
  • 1
  • 1
capdragon
  • 14,565
  • 24
  • 107
  • 153