7

I have to process a text file and check if it ends with a carriage return or not.

I have to read to whole content, make some changes and re-write it into the target file, keeping exactly the same formatting as original. And here is the problem: I don't know if the original file contains a line break or not at the end.

I've already tried:

  • the StreamReader.ReadLine() method, but the string that is returned does not contain the terminating carriage return and/or line feed.
  • also the ReadToEnd() method can be a solution, but I'm wondering about the performance in case of very big files. The solution has to be efficient.
  • getting the last 2 characters and check if them are equal to "\r\n" may resolve it, but I have to deal with lots of encodings, and it seems practically impossible to get them.

How can I efficiently read all the text of a file and determine whether it ended in a newline?

CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • 1
    If you have to read the whole content anyway, why are you worried about using ReadToEnd() ? – PhillipH Jan 14 '17 at 10:39
  • @PhillipH because ReadToEnd() returns the entire contents of the file. – CodeCaster Jan 14 '17 at 10:41
  • What does "process a .txt file" mean? Reading it line by line, do some processing and possibly reading an extra empty line at the end? – Alexei - check Codidact Jan 14 '17 at 10:42
  • 1
    @CodeCaster, that won't work. Depending on the encoding, these characters could mean something completely different. – Sefe Jan 14 '17 at 10:54
  • 1
    @CodeCaster As an example, a UCS-2 file or a UTF-16 file will end with something like `00 0D 00 0A`, depending on endianness. Each character, including ASCII, will have a length of 16 (or 32) bits. – Jeppe Stig Nielsen Jan 14 '17 at 11:13
  • 2
    @CodeCaster But if you do not know the encoding, you will not know if the hex sequence `000D 000A` is four characters (`"\0\r\0\n"`), two characters (`"\r\n"`), or something else. If you want to read one character at a time, you must care about encoding (whether a character is always 8 bits, always 16 bits, can vary 8/16/24 etc. bits, or something else). – Jeppe Stig Nielsen Jan 14 '17 at 11:19
  • @Jeppe yes I now realise I made a mess of the comments, and am gonna remove them. You do indeed need to care for the encoding. It's too early. – CodeCaster Jan 14 '17 at 11:24

2 Answers2

7

After reading the file through ReadLine(), you can seek back to two characters before the end of the file and compare those characters to CR-LF:

string s;
using (StreamReader sr = new StreamReader(@"C:\Users\User1\Desktop\a.txt", encoding: System.Text.Encoding.UTF8))
{
    while (!sr.EndOfStream)
    {
        s = sr.ReadLine();
        //process the line we read...
    }

    //if (sr.BaseStream.Length >= 2) { //ensure file is not so small

    //back 2 bytes from end of file
    sr.BaseStream.Seek(-2, SeekOrigin.End);

    int s1 = sr.Read(); //read the char before last
    int s2 = sr.Read(); //read the last char 
    if (s2 == 10) //file is end with CR-LF or LF ... (CR=13, LF=10)
    {
        if (s1 == 13) { } //file is end with CR-LF (Windows EOL format)
        else { } //file is end with just LF, (UNIX/OSX format)
    }

}
S.Serpooshan
  • 7,608
  • 4
  • 33
  • 61
  • One caveat though, `ReadLine()` eats newline characters. When the OP will be rewriting the file using the variable `s` and a newline appended, you're probably replacing the newline characters. – CodeCaster Jan 14 '17 at 11:15
  • Thanks guys. This will definitely work, but just for UTF-8 encoding. If you have to deal with several types of encoding, such in my case, there will be much more work to do. – Cristian Stirbe Jan 14 '17 at 22:27
  • 1
    @Cristy _"This will definitely work, but just for UTF-8 encoding"_ - what is shown in this answer is just one of the many constructors of the StreamReader. The code shown is encoding-agnostic. – CodeCaster Jan 15 '17 at 17:22
2

So you're processing a text file, meaning you need to read all text, and want to preserve any newline characters, even at the end of the file.

You've correctly concluded that ReadLine() eats those, even if the file doesn't end with one. In fact, ReadLine() eats the last carriage return when a file ends with a one (StreamReader.EndOfStream is true after reading the penultimate line). ReadAllText() also eats the last newline. Given you're potentially dealing with large files, you also don't want to read the entire file in memory at once.

You also can't just compare the last two bytes of the file, because there are encodings that use more than one byte to encode a character, such as UTF-16. So you'll need to read the file being encoding-aware. A StreamReader does just that.

So a solution would be to create your own version of ReadLine(), which includes the newline character(s) at the end:

public static class StreamReaderExtensions
{
    public static string ReadLineWithNewLine(this StreamReader reader)
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            int c = reader.Read();

            builder.Append((char) c);
            if (c == 10)
            {
                break;
            }
        }

        return builder.ToString();
    }
}

Then you can check the last returned line whether it ends in \n:

string line = "";

using (var stream = new StreamReader(@"D:\Temp\NewlineAtEnd.txt"))
{
    while (!stream.EndOfStream)
    {
        line = stream.ReadLineWithNewLine();
        Console.Write(line);
    }
}

Console.WriteLine();

if (line.EndsWith("\n"))
{
    Console.WriteLine("Newline at end of file");
}
else
{
    Console.WriteLine("No newline at end of file");
}

Though the StreamReader is heavily optimized, I can't vouch for the performance of reading one character at a time. A quick test using two equal 100 MB text files showed a quite drastic slowdown compared to ReadLine() (~1800 vs ~400 ms).

This approach does preserve the original line endings though, meaning you can safely rewrite a file using strings returned by this extension method, without changing all \n to \r\n or vice versa.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • Appears to work even if the file is in UTF-8 (say) and contains characters from outside [Unicode plane](https://en.wikipedia.org/wiki/Plane_(Unicode)) 0. In that case it appears that the `while` loop will iterate twice for that single character, creating _two_ UTF-16 code units (called `char` values in .NET) for the character, as is necessary (see [surrogate pair](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF)). Not well documented when the return type of `Read()` is `int`. One could have thought it would use UTF-32, returning the entire character at once. – Jeppe Stig Nielsen Jan 14 '17 at 15:44
  • @Jeppe thanks. And I thought I knew a thing or two about text encoding and how .NET handles Unicode. So I need to add another check, to verify whether `Read()` returned the first half for a surrogate pair, and if so, don't treat the second half as `\n` even though its code point is 10? Is 10 a valid second half of a surrogate pair? I can't do more testing at the moment. – CodeCaster Jan 14 '17 at 17:34
  • 1
    No, a 16-bit code unit cannot be valid as both a surrogate component and an ordinary "single" codepoint. Depending on the precise value of the 16-bit code unit (`char` in .NET), it will be either (1) a "single" codepoint, or (2) the lower part of a surrogate pair, or (3) the upper part of a surrogate pair, but it can never be more than one of those three. So if we assume that no invalid files are encountered, you do not really have to worry about this in your use. – Jeppe Stig Nielsen Jan 14 '17 at 18:55
  • @CodeCaster Thanks a lot ! This solution seems pretty reasonable. It's a good idea to create your own ReadLine() method. I will try this soon and come back with news. – Cristian Stirbe Jan 14 '17 at 22:31