0

I know this has been asked before, but neither of the solutions worked for me. I want to know if the file uploaded to my server (via a .ashx) is of type .xlsx, .xls or .csv.

I tried using the magic numbers listed here, but if I for example change the extension of a .msi to .xls, the file will be recognized as .xls... The following code ilustrates what i said:

private bool IsValidFileType(HttpPostedFile file)
{
    using (var memoryStream = new MemoryStream())
    {
        file.InputStream.CopyTo(memoryStream);
        byte[] buffer = memoryStream.ToArray();

        //Check exe and dll
        if (buffer[0] == 0x4D && buffer[1] == 0x5A)
        {
            return false;
        }

        //Check xlsx
        if (buffer.Length >= 3 &&
            buffer[0] == 0x50 && buffer[1] == 0x4B &&
            buffer[2] == 0x03 && buffer[3] == 0x04 ||
            buffer[0] == 0x50 && buffer[1] == 0x4B &&
            buffer[2] == 0x05 && buffer[3] == 0x06)
        {
            return true;
        }

        //Check xls
        if (buffer.Length >= 7 &&
            buffer[0] == 0xD0 && buffer[1] == 0xCF &&
            buffer[2] == 0x11 && buffer[3] == 0xE0 &&
            buffer[4] == 0xA1 && buffer[5] == 0xB1 &&
            buffer[6] == 0x1A && buffer[7] == 0xE1)
        {
            return true;
        }

        return false;
    }
}

Then I tried using urlmon.dll, something like the following, but it still recognizes the file as .xls

    [DllImport("urlmon.dll", CharSet = CharSet.Unicode, ExactSpelling = true, SetLastError = false)]
    static extern int FindMimeFromData(
        IntPtr pBC,
        [MarshalAs(UnmanagedType.LPWStr)] string pwzUrl,
        [MarshalAs(UnmanagedType.LPArray, ArraySubType=UnmanagedType.I1, SizeParamIndex=3)] byte[] pBuffer,
        int cbSize,
        [MarshalAs(UnmanagedType.LPWStr)] string pwzMimeProposed,
        int dwMimeFlags,
        out IntPtr ppwzMimeOut,
        int dwReserved);

    public static string GetMimeFromFile(string file)
    {
        if (!File.Exists(file))
            throw new FileNotFoundException(file + " not found");

        int MaxContent = (int)new FileInfo(file).Length;
        if (MaxContent > 4096) MaxContent = 4096;
        FileStream fs = File.OpenRead(file);


        byte[] buf = new byte[MaxContent];
        fs.Read(buf, 0, MaxContent);
        fs.Close();
        int result = FindMimeFromData(IntPtr.Zero, file, buf, MaxContent, null, 0, out IntPtr mimeout, 0);

        if (result != 0)
            throw Marshal.GetExceptionForHR(result);
        string mime = Marshal.PtrToStringUni(mimeout);
        Marshal.FreeCoTaskMem(mimeout);
        return mime;
    }

I was thinking that maybe I should try to open the uploaded file with some library for example ExcelDataReader but I'm not sure if this is the best approach.

Any help would be appreciated.

  • Why can't you check the extension then verify the appropriate magic bytes? (although the latter isnt going to work for csv) – Alex K. Aug 31 '18 at 14:00
  • Just for clearity: Is it enough to detect if a file is _not_ what the extension says it should be or do you actually need to detect what type of file it _is_ disregarding extension completely? – Fildor Aug 31 '18 at 14:02
  • @AlexK. im currently doing that, but, if i have a .msi file and then change its extension to .xls, even if i check the appropriate magic bytes the result will be still the same (it seems that .msi header bytes are the same as .xls) – Efrain Bastidas Berrios Aug 31 '18 at 14:09
  • @Fildor i can no trust the file extension, since the user could upload a .msi with a .xls extension – Efrain Bastidas Berrios Aug 31 '18 at 14:11
  • Yes, I get that. So you only need "fraud-detection". It's ok to cancel the operation and send back an error if actual file type != ext file type, right? – Fildor Aug 31 '18 at 14:14
  • @Fildor yes, im currently checking first if the file has any of the extension that i want and if it does, i check if use the methods in my question to validate the file (which arent working as expected) – Efrain Bastidas Berrios Aug 31 '18 at 14:17
  • You cannot do this server side because by that time, file has already been uploaded. Please see [this](https://stackoverflow.com/questions/71944/how-do-i-validate-the-file-type-of-a-file-upload). – CodingYoshi Aug 31 '18 at 14:18
  • @CodingYoshi I don't think that's a good idea. One could easily avoid the client side validation with for example disabling JS, or by just not using the browser and creating the POST request himself (like with System.Net.WebClient). Anyway, what OP actually encountered is a problem of shared signatures among different file types. Different files can use the same file format, but have a different structure, and this client validation does not address that problem. – Mario Z Sep 01 '18 at 09:04
  • @marioz Yes but there is no other choice. At least I cannot think of another choice. It's true the user can bypass clientside validation and that has been mentioned in the linked answer. However, in this case the user can bypass the server validation too. Like all the user can do is minimize it. – CodingYoshi Sep 01 '18 at 12:07
  • @CodingYoshi exactly, all we can do is minimize it, and thus we should do that on server side to cover more ground. If you want, you can do that on both sides, but you cannot ignore the server side... it's much safer to ignore the client side... – Mario Z Sep 01 '18 at 18:37

3 Answers3

1

How about open file Excel by EPPlus of Interop and catch an exception if it isn't an excel file

FileInfo fileInfo = new FileInfo(filePath);
ExcelPackage package = null;
try
{
    package = new ExcelPackage(fileInfo);
}
catch(Exception exception)
{
}

Or there is a 3rd party (not tested) which verify the type of file.

FileInfo file = new FileInfo("C:\Hello.pdf");
if ( file.isExcel())
    Console.WriteLine("File is PDF");
Antoine V
  • 6,998
  • 2
  • 11
  • 34
  • 1
    Opening a file and catching exception will be extremely slow. – CodingYoshi Aug 31 '18 at 14:12
  • @CodingYoshi first, we're talking about an edge case in which the user provided an invalid file. So throwing and handling the exception is the right thing to do. Also, the code is explicit, it clearly shows an invalid state, and this is better for maintenance. Also, I think you have some prejudgments about the exceptions, it's not like you're throwing them in a loop or something... you're literally going to add few ms to the upload's execution, is that your definition of extreme? – Mario Z Sep 01 '18 at 08:33
  • @marioz I have no prejudgment about exceptions and understand it will not add a lot of time. It's opening the file which will be time consuming. And exception will add to that. If the user is adding multiple files and you are opening each for validation, it will definitely be slow. Go try it. My comment was a heads up and an FYI. – CodingYoshi Sep 01 '18 at 12:15
  • @CodingYoshi just opening a file's stream is not time consuming at all. The execution depends on what you're doing with it, for instance is the plan to read the whole file? If that is the case then yes, of course there will be some penalties. – Mario Z Sep 01 '18 at 18:45
  • @CodingYoshi But anyway, I still think that the best solution (in terms of a security) that we currently have is to try reading the expected Excel file with some Excel library. Some of those libraries have a lazy processing and thus will not read the complete spreadsheet, but will read enough to know that it can or cannot be processed as an Excel file (has the expected structure). For instance, for XLSX files that would be OpenXML SDK or some library that is based on it. – Mario Z Sep 01 '18 at 18:47
  • 1
    I ended using a fork of a fork of the third party library you posted [link](https://github.com/clarkis117/Mime-Detective) and its working pretty well – Efrain Bastidas Berrios Sep 02 '18 at 17:40
0

A file in itself is just data. The file extension allows your system to interpret that data accordingly. Without a file extension, there's no way of knowing with absolute certainty which file type you're looking at. (Unless you're working with a limited subset of file types)

You can however infer from the data which file extension it MIGHT be. The project that Thierry V referenced is out of date and not mantained.

You might instead want to look at a tool like TrID, which uses a continually growing library of file types. This tool will analyze a file and give a ranking of the most probable file types. Like I said before, it can only tell you with a limited amount of certainty which file type it might be.

Joshua VdM
  • 628
  • 5
  • 16
0

I tried using the magic numbers listed here, but if I for example change the extension of a .msi to .xls, the file will be recognized as .xls... The following code ilustrates what i said:

Yes that is true, the only thing that you can determine when checking the file's signature is the format on which the file is based on. So for ".xls" file you will detect that the file is of a compound binary format. However, as you noticed this format is used in ".msi" files, but also in ".doc", ".ppt", etc.

Also, the same is true for your ".xlsx" detection, it is just checking that the file is of a zip format and the same signature will be found in ".zip", ".docx", ".ods", etc.

So, you could check the file's signature and pass through files that are of those two formats, but what about ".csv"? Here, you can have various byte values because it's just a plain text, it doesn't have a signature.

Anyway, I think the real question is what is your goal with those Excel files? Do you need to further process them or what?
If you need to process them further then you should rely on a failing mechanism of the one that is reading that file. So whichever library you pick to read the file will most likely throw an exception because of either an "unrecognized format" or "unrecognized structure" of the file.

By "unrecognized structure" what I mean is, for instance in ".xls" file it's expected to have streams named "Workbook", "SummaryInformation", etc.

Mario Z
  • 4,328
  • 2
  • 24
  • 38