0

I'm working on validating some files from our customers who have to meet a specific file format. Each line has multiple fixed length fields and ends at 511 characters with characters 512 and 513 being CR and LF.

I've been able to use a substring to get each field easily enough, but I'm having an issue with StreamReader/ReadLine locating the 512th and 513th characters. When trying to use a substring to locate those characters, I'm getting the "System.ArgumentOutofRangeException" error.

StreamReader file = new StreamReader(textBox2.Text);
while ((line = file.ReadLine()) != null)
{
    int lineLength = line.Length;
    string crlf = "";
    /*
    if (lineLength == 511)
    {
        crlf = line.Substring(511, 2);
    }
    */
 }

I've commented out the part that gives the error. What are my options to confirm the end of each line is CRLF?

John Thomas
  • 88
  • 1
  • 8
Cody
  • 13
  • 3
  • `What are my options to confirm the end of each line is CRLF?` Don't use `ReadLine`. Use `ReadBlock` or `ReadToEnd`. – mjwills Sep 25 '18 at 21:27
  • 3
    `ReadLine` will return the line without the linebreak. – Pretasoc Sep 25 '18 at 21:30
  • Why do you need to confirm CR/LF? The method name is `ReadLine()`. Nomen est omen. What do you think `ReadLine()` does if not reading a line? The end of a line is marked by CR/LF, LF or "end of stream", or do you disagree? Note, i said they are markers; they are not part of aline of text (a line of text can also end just like that with an EOF/EOS), they just mark the end of the line. What are you really trying to achieve here? –  Sep 25 '18 at 21:32
  • @elgonzo I understand that ReadLine() reads until the end of the line. The file format requirements state the end of each line needs to be CRLF. If someone uses just LF then we run into errors. – Cody Sep 25 '18 at 21:36
  • @Cody: Are multi-byte characters allowed? For example, can the file be encoded using UTF-8, or are only ASCII characters expected? There's a difference between 513 bytes per line and 513 characters per line. – Michael Liu Sep 25 '18 at 21:38
  • @Cody, i see. Unfortunately, you cannot use ReadLine() then, as the CR/LF handling is hard-coded into the method (see here: https://referencesource.microsoft.com/mscorlib/R/a4ada5f765646068.html). mjwills's suggestions would be the way to go then. (You can of course take inspiration from ReadLine's implementation...) –  Sep 25 '18 at 21:40
  • @MichaelLiu Characters 1-511 are a mix of alpha, numeric, and alphanumeric. 512 must be ASCII 0D and 513 must be ASCII 0A – Cody Sep 25 '18 at 21:40
  • @mjwills I'm reading into ReadBlock and ReadToEnd now – Cody Sep 25 '18 at 21:42
  • If it's a smallish file, you can call File.ReadAllText and validate that the entire string matches the regular expression `^([0-9A-Za-z]{511}\r\n)*$`. – Michael Liu Sep 25 '18 at 21:48
  • note that the code you commented out is guaranteed to fail. It says "if the line is 511 chars long get the 512th and 513th characters and put in a variable" this can never work – pm100 Sep 25 '18 at 23:34

1 Answers1

0

As others have said in the comments, you can't use ReadLine() and expect to find the EOL characters in the returned string. Now, if I understand you correctly, with StreamReader, you can do something like this:

using (var sr = new StreamReader(filePath))
{
    long position = 0;
    char[] buffer;
    while (sr.Peek() >= 0)
    {
        buffer = new char[513];
        sr.Read(buffer, 0, buffer.Length);
        if (buffer[511] == '\r' && buffer[512] == '\n')
        {
            position += 513;
            Console.WriteLine("CRLF");
        }
        else if (buffer[511] == '\r' || buffer[511] == '\n')
        {
            position += 512;
            sr.BaseStream.Seek(position, SeekOrigin.Begin);
            sr.DiscardBufferedData();
            Console.WriteLine("CR or LF");
        }
        else
        {
            Console.WriteLine("Something went wrong!");
            break;
        }
    }
}

This will read chunks of 513 characters at a time and check for the following:

  • If the last two characters are CRLF, then we're good to go and it will continue reading the next 513 characters.

  • If the above test doesn't pass but the character in position 511 is either a CR or an LF, it will set the position one character back (considering that the previous "line" was 512 characters, not 513) and continue reading the next 513 characters.

  • If none of the above tests passes (i.e., the character in position 511 is neither CR nor LF), it will exit the loop. You can adjust this as per your requirements.

And if you wanted to access the current line as a string, you can easily do that using something like:

string line = new string(buffer, 0, 511);
  • FYI: `sr.Read(buffer, 0, buffer.Length);` could be made more robust by checking its return value (=number of chars read from the stream) to protect against premature end of data due to incomplete/truncated file. –  Sep 25 '18 at 22:26
  • 1
    Alternatively to testing the return value of sr.Read for detecting incomplete/truncated files, the last two characters in the array could be set to zero (or something else that is neither CR nor LF) before calling sr.Read to reliably detect whether a line of sufficient length with CR/LF has been read... –  Sep 25 '18 at 22:35
  • @elgonzo You are right. I moved the array initialization inside the loop. That way, if the length is insufficient, the last two characters will be nulls which will cause it to jump into the else branch. – 41686d6564 stands w. Palestine Sep 25 '18 at 22:45
  • @AhmedAbdelhameed more information to account for.. Each line has the option to be either 509 or 513 in length. Both will require the CRLF in the last two characters. I'm guessing to check the length of the file and have it do this check if line.length returns 511, similar to above if it returns 507, and present an error if it returns anything other than 511 or 507? – Cody Sep 26 '18 at 03:39
  • Instead of tracking the full position, why not seek using a negative offset relative to `SeekOrigin.Current`? In the `else if`, you'd just have to seek by `-1` – pinkfloydx33 Sep 26 '18 at 07:59
  • What happens if you've got a malformed line that is only (for ex) 100 chars? You'd pull in all of that line and part of the next and end up rejecting both--and then all lines after. If that's in any way possible for the OP's files, they'd have to search the buffer for any newlines in the middle and then reset the position after them. . – pinkfloydx33 Sep 26 '18 at 08:01
  • @pinkfloydx33 That doesn't work because StreamReader uses buffers, so `Current` [isn't exactly what you expect to be](https://stackoverflow.com/a/5404324/4934172). Regarding your second question: If that happens, it'll break out of the loop (check the `else` branch). The OP didn't specify what he wants to do in this case and that's why I told him he might handle this differently if he wants to. – 41686d6564 stands w. Palestine Sep 26 '18 at 09:13
  • @AhmedAbdelhameed never knew that about Current. I don't usually find myself working directly with streams (so much that I need to seek, anyways)... Good to know – pinkfloydx33 Sep 26 '18 at 09:14
  • 1
    @Cody You specifically mentioned that each line has a fixed length of 511 charachters (excluding the EOL characters), now you're changing the requirements. Anyway, in this case, you can check for characters in position 507 and 508 and seek back accordingly if EOL is found. If not, you can continue checking for positions 511 and 512. – 41686d6564 stands w. Palestine Sep 26 '18 at 09:16