I built an application that accepts .txt data files via an ASP.NET File Upload Control. I am using this code to take the uploaded file and read it into a byte array:
byte[] fileInput = new byte[FileUpload1.PostedFile.ContentLength];
FileUpload1.FileContent.Read(fileInput, 0, FileUpload1.PostedFile.ContentLength);
The resultant byte array is stored in a SQL Server database column of type varbinary(max).
Later in the process lifetime, a separate job polls the database for files to be processed. It selects the stored file and converts the return byte array into a list of type string, splitting on carriage return and/or line feed:
byte[] byteArray = GetMyFile();
List<string> myRecords = null;
myRecords = Encoding.UTF8.GetString(byteArray).Split(new string[] {"\r\n", "\n"}, StringSplitOptions.RemoveEmptyEntries).ToList();
This works fine for almost all of the files that the program receives. However, for a very small percentage of files, I have been seeing extra characters when I convert to the list. There is an extra '/0' (without quotes) that is being added between each character of each line in the file.
A normal line of data should like this (partial line):
A|B|C|D
The data now looks like this in the myRecords list for these few files:
\0A\0|\0B\0|\0C\0|\0D\0
I have encountered Byte Order Marks before, but these don't appear to be the same thing and I am at a loss as to why they appear, albeit rarely. If I take the same byte array and write the file back out using something like File.WriteAllBytes, it looks fine in a text editor (notepad, notepad++). No extra characters.
I am assuming at this point it is some kind of encoding issue, but I hesitate to make code changes because the vast majority of uploaded files do not display this behavior. The uploaded data comes from a variety of companies and I do not have the capability to ask them all what method they use to create the files initially. Perhaps they are unintentionally injecting something during their file creation process.
For files that pass a round of validation, they are written to a secure area on the network. I have converted those file to byte arrays and compared the length to the version stored in the database. They are not the same size so I know something is different, but I don't know how it is happening.
Is there a different way to save the initial uploaded file to prevent this kind output?
Thanks for any help or suggestions.