2

I have a project in the works to read and convert CSV files based on a set of arbitrary rules, pick a file tell the program how it should output the data based on the input and parse the file.

The problem that I have is when I read the lines from my input files it will sometimes read additional lines or split lines halfway through into two lines, I initially used ReadAllLines then tested with this code:

int testCount = 0;
StreamReader sr = File.OpenText(_FilePath.Text);
while(!sr.EndOfStream)
{
    sr.ReadLine();
    testCount++;
}
sr.Close();
sr.Dispose();

Console.WriteLine("Lines in For: " + testCount);

and found that a file that has 627 lines is being read as having 681 lines (using both ReadAllLines and counting the lines in the above code.

I tried looking for people having the same issue and tried looking to see if there was perhaps a max length of a 'line' in these methods, Nothing turned up on google, the first line in the file that acts up is this one (changed information in the line to protect privacy, all special characters are present)

CODE, A/B Company Name, CONTACT NAME, ATTN  NAME A/B, 1234 CORPORATE CORP ST, Smithington, SM, 1234, , 123-456-7890, 123-456-7890, 12345 Plum ROAD, , Nowhere, NW, 12345, A/B Company Name2, Courier, , "Some A Info B For.Shipping Accnt. # 123456789 calendar days early^ 3 days late.", , 

The file itself was exported out of an excel Spreadsheet to CSV, all commas in the original file were replaced with ^ (to prevent issues) and will be re-converted to commas later.

So, anyone know of a limit to the length of a line in ReadAllLines or is there something else going on here behind the scenes? since this was exported from Excel (originally a DBF file) I don't 'think' this is an issue with the file, but I could be wrong, anything I can do to find out?

RyanTimmons91
  • 460
  • 1
  • 5
  • 17
  • It is totally likely that these files you're observing use an encoding different then UTF8. Check this out: http://stackoverflow.com/questions/10903120/fileinfo-opentext-fails-to-read-special-characters-e-g-%C3%BC%C3%B2%C2%B0 – Alexandru Nov 22 '14 at 05:04
  • PS: You can view the encoding in Notepad++, and/or select to view that file in a different encoding there, and it should show you this change in line numbers. – Alexandru Nov 22 '14 at 05:05
  • 1
    @Alexandru: UTF8 uses the same line-break characters as ASCII, Windows-1252, ISO-8859, etc. The only common encoding that would be radically different is UTF16, and if the file were being read incorrectly as UTF16, or as something else when it's supposed to be UTF16, there'd be a lot more wrong with it than just extra line-breaks. – Peter Duniho Nov 22 '14 at 05:10
  • If you want to _really_ see what's in the file use XVI32 a hex editor: http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm – Steve Wellens Nov 22 '14 at 05:13
  • @PeterDuniho Ah, interesting. – Alexandru Nov 22 '14 at 05:14

2 Answers2

3

I guarantee that File.ReadAllLines() and StreamReader.ReadLine() are both behaving exactly as documented, with no hidden traps for you to stumble into.

Do note that neither distinguish between different line-break modes. In a single file, they will happily break a line on \r, \n, and \r\n. Note that this means a file which nominally uses the Windows standards of \r\n, but which has extra \r and/or \n characters in it will be interpreted as having extra line breaks. Note also that while \r\n is treated as a single line break, \n\r is treated as two line-breaks.

The way to diagnose exactly what's going on is to look at the file as binary. First, check your output to see where it's breaking the lines, and in particular the first place you find where it breaks a line where you believe it should not have.

Then, open the file in Visual Studio, but instead of just opening it, select the "Open With..." option (click the black triangle on the "Open" button), and choose "Binary Editor". Look through the file to find the text where the first unwanted line break occurred and check the hex values in the file at that location. You will find some combination of \r, \n, or \r\n there (\r is the hex value 0D and \n is 0A).

Peter Duniho
  • 68,759
  • 7
  • 102
  • 136
  • The part that was breaking it, in the original program apparently had a textbox that supported multiple lines, adding line breaks to the column, but notepad and excel do not show those line breaks, C# however reads them. Good to know for future reference but, come on, this is kind of a silly issue to have. – RyanTimmons91 Nov 22 '14 at 05:16
2

Please specify the encoding of the file while you read the file. File.OpenText uses UTF8 encoding by default. Try this:

string[] lines = File.ReadAllLines(path, encoding); //UTF-16 or ASCII etc

http://msdn.microsoft.com/en-us/library/bsy4fhsa(v=vs.110).aspx

Piyush Parashar
  • 866
  • 8
  • 20