9

i am working on a project that reads all files from local Hdd, i specify the extensions i would like to include in the search.

all chosen file extentions are based on the fact that the file is of text content.

so for my use i could specify which extensions to take into acount, such as .cs .html .htm .css .js etc'

what if i want to add a feature that would let generic user to select extensions and let him choose from all available windows file extensions but to include in that list only those file in his system that are textual. for instance we know that exe, mp3. mpg, avi are not but he could have some other types of files (.extensions) that we did not take into account.

is there a way to decide that based on system file property, if not what would be the way to filter only text content files?

Jbob Johan
  • 221
  • 1
  • 7
  • 2
    There is no good way to do that... So hackish "try read and it is text if you can understand content" is "the best". You may consider searching for "detect file type without extension" (or something similar) for previous discussions on topic. – Alexei Levenkov Nov 14 '15 at 19:05
  • Extensions only provide an weak indication of a file's contents. I bet there are applications out there that also use the `.cs` extension without these files containing text. – C.Evenhuis Nov 14 '15 at 19:07
  • I don't think there is one, at least not a generic one. For instance: the extension docx from a word file is not text as such, docx-files are zipped XML files. But as a user, you would probably expect word files to be considered text. – Dirk Trilsbeek Nov 14 '15 at 19:08
  • 2
    @DirkTrilsbeek `docx`, `doc` are parsable through .Net dedicated class so it should be considered as textual both cause you have writen into it and you can parse it too – Jbob Johan Nov 14 '15 at 19:16
  • @JbobJohan that is exactly what I mean. There is no generic way, based on the file itself, to determine if a file contains textual content. Because in my example, docx contains text content, but from a technical point it isn't text. Of course you can read doc/docx, but what about lots of other formats that are built similar but are just unknown to you? You can't interpret what you haven't heard of yet. – Dirk Trilsbeek Nov 14 '15 at 19:19

2 Answers2

3

One mechanism for Windows machines is to look up the Content Type in the Windows Registry associated with the file extension. (I do not know of a way to do this without a direct registry lookup.)

Within the registry, file extensions that are text-based should generally have one or more of these characteristics:

  • A Content Type indicating MIME primary type of text, e.g, text/plain or text/application
  • A Perceived Type of text
  • A default handler with the GUID {5e941d80-bf96-11cd-b579-08002b30bfeb}, assigned to the plain text persistent handler.

The following method will return all system extensions associated with these characteristics:

// include using reference to Microsoft.Win32;
static IEnumerable<string> GetTextExtensions()
{
    var defaultcomp = StringComparison.InvariantCultureIgnoreCase;
    var root = Registry.ClassesRoot;
    foreach (var s in root.GetSubKeyNames()
        .Where(a => a.StartsWith(".")))
    {
        using (RegistryKey subkey = root.OpenSubKey(s))
        {
            if (subkey.GetValue("Content Type")?.ToString().StartsWith("text/", defaultcomp) == true)
                yield return s;
            else if (subkey.GetValue("PerceivedType")?.ToString().Equals("text", defaultcomp) == true)
                yield return s;
            else
            {
                using (var ph = subkey.OpenSubKey("PersistentHandler"))
                {
                    if (ph?.GetValue("")?.ToString().Equals("{5e941d80-bf96-11cd-b579-08002b30bfeb}", defaultcomp) == true)
                        yield return s;
                }

            }
        }
    }
}

The output depends on the workstation configuration, but on my current machine returns:

.a, .AddIn, .ans, .asc, .asm, .asmx, .aspx, .asx, .bas, .bat, .bcp, .c, .cc, .cd, .cls, .cmd, ...

While this depends on application installers correctly mapping file extensions, it appears to identify most of the major text file types.

drf
  • 8,461
  • 32
  • 50
  • btw using reference to Microsoft.Win32 ..where did you call any method in `Win32`? – Jbob Johan Nov 14 '15 at 20:24
  • @JbobJohan The Registry classes are in the Microsoft.Win32 namespace. – drf Nov 14 '15 at 20:25
  • sorry , i was using `RegistryKey` without remembring the need to reference (: meaning didn't realize till now it's a Win32 Feature rather .Net standared – Jbob Johan Nov 14 '15 at 20:32
  • Since the Registry is Windows-specific, registry classes are in the Microsoft namespace instead of the more common System namespace. But these are .NET standard classes on Windows; the registry classes are exported in mscorlib.dll, along with other core .NET classes. – drf Nov 14 '15 at 20:49
  • i have marked this as the correct answer,i don't think there's any more to be added and it should cover the task requierments as much as it could be possible done programmaticaly. cheers – Jbob Johan Nov 14 '15 at 21:01
0

In general, there isn't any good and reliable way to do this.

You cannot decide by comparing file extensions - it is just a part of filename and everyone can change it so even file.exe can be a plain-text file.

C# - Check if File is Text Based
You could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range.

Community
  • 1
  • 1
Martin Heralecký
  • 5,649
  • 3
  • 27
  • 65
  • 2
    i have not realized that _"there is no way"_ is an option with programming, specially in such a trivial task. – Jbob Johan Nov 14 '15 at 19:37
  • so i guess the workaround is to specify all you do know and add an option for the user to add.. (and he would be able to ADD ANY !!) – Jbob Johan Nov 14 '15 at 19:39
  • 1
    @LorenPechtel Actually, those are exactly the same unprintables as ASCII ones. Unless you mean UTF-16, where every other byte is `\0`. – Mr Lister Nov 19 '15 at 09:11