-1

I'm needing to create a file validator that can check if the file type is correct. Originally we were just checking the content-type of the request, but as always our testers have managed to get around the restriction by simply changing the file extension of in the case an exe file to .csv which can fool our straightforward check.

This is what I have so far in the validator

private bool IsCorrectFileType(IFormFile file)
        {
            using var reader = new StreamReader(file.OpenReadStream());
            using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);

            try
            {
                csv.Read();
                csv.ReadHeader();
                List<string> headers = csv.Context.HeaderRecord.ToList();
            }
            catch (Exception _)
            {
                return false;
            }

            return true;
        }

What I was intending on doing is if the CSV reader couldn't find the headers in the file then I was hoping it would blow up and return false but what's happening is the full content of the file is being read in as a single header in all non-csv file type situations. causing it to think that it was indeed a valid csv file and return true.

I cannot for the life of me work out a way to catch if the CSV file is indeed valid as in most cases the CSV reader can read in all the streams as byte data and the context of the header record looks like a valid CSV in this case.

What's annoying as well as much as we will never be uploading a file with a single header it feels dumb to just do a standard count on the headers to see if it has just one header to catch this issue.

halfer
  • 19,824
  • 17
  • 99
  • 186
Chris Marshall
  • 740
  • 1
  • 9
  • 25
  • This is kind of the opposite, but does it help? https://stackoverflow.com/questions/2863683/how-to-find-if-a-file-is-an-exe – stuartd May 13 '20 at 23:13
  • In a way its helpful but i think the problem is bigger than that as you will need to protect against every file format that we don't support in this case would be much better if we could somehow identify if it was a csv file but as far as I'm aware its just a standard text file that's comma separated so that might be quite difficult. Reason I say that is they could do the same with a jpeg then a png then an xls and so on. – Chris Marshall May 13 '20 at 23:17
  • I am not sure where your CsvReader class comes from and I would guess that it comes from some framework on which you don't have control. Instead of using that, why don't you just read the first x bytes of the file and see if you can parse it in a series of coma separated strings? That should tell you that you indeed have a CSV file. – Franno May 13 '20 at 23:19
  • Agreed have just had a look through the content of different file types and in byte form, none of them seems to contain commas that might be something to check on. – Chris Marshall May 13 '20 at 23:23

2 Answers2

1

This is how I would do it.

  1. Check the file for any bytes that is 0x00. These tend to be common in binary files but not allowed in text files, except possibly at the very end as a null terminator. So this can be a relatively fast sanity check.

  2. Divide the file into lines (e.g. split on line delimiters \n and \r), then check each line to ensure it has the same number of commas. Note that some columns may contain commas within them, and you mustn't count those; the column containing the embedded commas will be enclosed in quotes to escape them. So you have write a little code to parse the line to do the counting.

  3. If both of the above steps pass, it is still possible the file isn't valid, e.g. if it contains invalid UTF sequences. See this post if you want to check for those.

  4. If you know something about what is supposed to be in the file, use regular expressions to validate each and every row and column to see if the file is valid overall.

You could implement just step 1 above, or 1 & 2, or all of them, depending on how critical this is.

John Wu
  • 50,556
  • 8
  • 44
  • 80
0

After experimenting with what the string of a non-csv type looks like in the csv parsers header context I was able to assert if it was just jibberish ie exe content or jpg and so one it would contain non-ascii characters in the long string.

The below code shows what I have done to check if this is the case. if so it rejects it if not it allows it to be ingested.

/// <summary>
/// Minimises chances of incorrect file types being passed to the service that have been
/// maliciously changed to a csv format when the original is for example .exe .jpg and so on.
/// </summary>
/// <remarks>
/// The function below checks if a header row exists in the incoming file. In all cases where the CsvReader is
/// able to read the file it will either create a list of headers if the file is valid or subsequently if the file
/// uploaded has been modifed to look like a csv file the Context.HeaderRecord will read in all of the content to a
/// single header. If there is only one header in the file to make sure the file is valid I an running a string function
/// on the header to make sure it definitely includes ascii charachters if not in the case of any file thats malliciously
/// been changed it will load all of the bytes into the headerRecord which means it will fail the chack and fail validation.
/// This will in turn minimise the chances of a malicious file thats had its name changed name changed from hitting the file processor.
/// </remarks>
private bool IsCsvFileFormat(IFormFile file)
{
      using var reader = new StreamReader(file.OpenReadStream());
      using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);

      try
      {
           csv.Read();
           csv.ReadHeader();
           var headerRecordList = csv.Context.HeaderRecord.ToList();

           if (headerRecordList.Count() == 1)
               return !HasNonASCIIChars(headerRecordList.ElementAt(0));
       }
       catch (Exception _)
       {
            return false;
       }

       return true;
}

private bool HasNonASCIIChars(string str) =>
     (System.Text.Encoding.UTF8.GetByteCount(str) != str.Length);
Chris Marshall
  • 740
  • 1
  • 9
  • 25