ASP.NET File Upload Control - FileContent.Read method

Question

I built an application that accepts .txt data files via an ASP.NET File Upload Control. I am using this code to take the uploaded file and read it into a byte array:

byte[] fileInput = new byte[FileUpload1.PostedFile.ContentLength];
FileUpload1.FileContent.Read(fileInput, 0, FileUpload1.PostedFile.ContentLength);

The resultant byte array is stored in a SQL Server database column of type varbinary(max).

Later in the process lifetime, a separate job polls the database for files to be processed. It selects the stored file and converts the return byte array into a list of type string, splitting on carriage return and/or line feed:

byte[] byteArray = GetMyFile();
List<string> myRecords = null;

myRecords = Encoding.UTF8.GetString(byteArray).Split(new string[] {"\r\n", "\n"}, StringSplitOptions.RemoveEmptyEntries).ToList();

This works fine for almost all of the files that the program receives. However, for a very small percentage of files, I have been seeing extra characters when I convert to the list. There is an extra '/0' (without quotes) that is being added between each character of each line in the file.

A normal line of data should like this (partial line):

A|B|C|D

The data now looks like this in the myRecords list for these few files:

\0A\0|\0B\0|\0C\0|\0D\0

I have encountered Byte Order Marks before, but these don't appear to be the same thing and I am at a loss as to why they appear, albeit rarely. If I take the same byte array and write the file back out using something like File.WriteAllBytes, it looks fine in a text editor (notepad, notepad++). No extra characters.

I am assuming at this point it is some kind of encoding issue, but I hesitate to make code changes because the vast majority of uploaded files do not display this behavior. The uploaded data comes from a variety of companies and I do not have the capability to ask them all what method they use to create the files initially. Perhaps they are unintentionally injecting something during their file creation process.

For files that pass a round of validation, they are written to a secure area on the network. I have converted those file to byte arrays and compared the length to the version stored in the database. They are not the same size so I know something is different, but I don't know how it is happening.

Is there a different way to save the initial uploaded file to prevent this kind output?

Thanks for any help or suggestions.

If they don't cause any problems, why not just filter them out in your original conversion process? If you don't create the files, the question as to why they are there is moot, yes? It's only about removing them if they are found. http://stackoverflow.com/questions/5132890/c-sharp-replace-bytes-in-byte — Shannon Holsinger, Aug 31 '16 at 14:46
@ShannonHolsinger - I will probably end up adding a filter to ensure correct formatting. I was just hoping there was an explanation for it due to the original uploaded text files all looking correct when viewed in a text editor. Thanks for answering. — user2564788, Aug 31 '16 at 19:47
It's the binary Gremlins. For something that's supposed to be nothing but logical gates that can NOT do anything differently than they are programmed to, I've seen some damned unexplainable stuff. I'm still trying to reproduce this one instance when it seemed as though my code took over and did some totally irrational things. It's a strange world inside that little box, to be sure. — Shannon Holsinger, Aug 31 '16 at 20:57
You're not checking that the input encoding is UTF-8. Is it perhaps UTF-16 in some cases? Use Notepad++ or Programmers File Editor to get some visibility into your file encoding. Or, use a hex editor to compare input and output files. — pseudocoder, Sep 01 '16 at 14:48

ASP.NET File Upload Control - FileContent.Read method

0 Answers0