101

I need to process a large file, around 400K lines and 200 M. But sometimes I have to process from bottom up. How can I use iterator (yield return) here? Basically I don't like to load everything in memory. I know it is more efficient to use iterator in .NET.

Matthew Murdoch
  • 30,874
  • 30
  • 96
  • 127
Liang Wu
  • 1,772
  • 3
  • 18
  • 26
  • See also: [Get last 10 lines of very large text file > 10GB c#](http://stackoverflow.com/questions/398378) – hippietrail Nov 05 '12 at 18:30
  • One possibility would be to read a sufficiently large amount from the end and then use String.LastIndexOf to go backwards searching for "\r\n". – Sam Hobbs Jun 12 '15 at 19:36
  • See my comment in the duplicate http://stackoverflow.com/questions/398378/get-last-10-lines-of-very-large-text-file-10gb-c-sharp/33907602#33907602 – Xan-Kun Clark-Davis Nov 25 '15 at 02:46

11 Answers11

153

Reading text files backwards is really tricky unless you're using a fixed-size encoding (e.g. ASCII). When you've got variable-size encoding (such as UTF-8) you will keep having to check whether you're in the middle of a character or not when you fetch data.

There's nothing built into the framework, and I suspect you'd have to do separate hard coding for each variable-width encoding.

EDIT: This has been somewhat tested - but that's not to say it doesn't still have some subtle bugs around. It uses StreamUtil from MiscUtil, but I've included just the necessary (new) method from there at the bottom. Oh, and it needs refactoring - there's one pretty hefty method, as you'll see:

using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace MiscUtil.IO
{
    /// <summary>
    /// Takes an encoding (defaulting to UTF-8) and a function which produces a seekable stream
    /// (or a filename for convenience) and yields lines from the end of the stream backwards.
    /// Only single byte encodings, and UTF-8 and Unicode, are supported. The stream
    /// returned by the function must be seekable.
    /// </summary>
    public sealed class ReverseLineReader : IEnumerable<string>
    {
        /// <summary>
        /// Buffer size to use by default. Classes with internal access can specify
        /// a different buffer size - this is useful for testing.
        /// </summary>
        private const int DefaultBufferSize = 4096;

        /// <summary>
        /// Means of creating a Stream to read from.
        /// </summary>
        private readonly Func<Stream> streamSource;

        /// <summary>
        /// Encoding to use when converting bytes to text
        /// </summary>
        private readonly Encoding encoding;

        /// <summary>
        /// Size of buffer (in bytes) to read each time we read from the
        /// stream. This must be at least as big as the maximum number of
        /// bytes for a single character.
        /// </summary>
        private readonly int bufferSize;

        /// <summary>
        /// Function which, when given a position within a file and a byte, states whether
        /// or not the byte represents the start of a character.
        /// </summary>
        private Func<long,byte,bool> characterStartDetector;

        /// <summary>
        /// Creates a LineReader from a stream source. The delegate is only
        /// called when the enumerator is fetched. UTF-8 is used to decode
        /// the stream into text.
        /// </summary>
        /// <param name="streamSource">Data source</param>
        public ReverseLineReader(Func<Stream> streamSource)
            : this(streamSource, Encoding.UTF8)
        {
        }

        /// <summary>
        /// Creates a LineReader from a filename. The file is only opened
        /// (or even checked for existence) when the enumerator is fetched.
        /// UTF8 is used to decode the file into text.
        /// </summary>
        /// <param name="filename">File to read from</param>
        public ReverseLineReader(string filename)
            : this(filename, Encoding.UTF8)
        {
        }

        /// <summary>
        /// Creates a LineReader from a filename. The file is only opened
        /// (or even checked for existence) when the enumerator is fetched.
        /// </summary>
        /// <param name="filename">File to read from</param>
        /// <param name="encoding">Encoding to use to decode the file into text</param>
        public ReverseLineReader(string filename, Encoding encoding)
            : this(() => File.OpenRead(filename), encoding)
        {
        }

        /// <summary>
        /// Creates a LineReader from a stream source. The delegate is only
        /// called when the enumerator is fetched.
        /// </summary>
        /// <param name="streamSource">Data source</param>
        /// <param name="encoding">Encoding to use to decode the stream into text</param>
        public ReverseLineReader(Func<Stream> streamSource, Encoding encoding)
            : this(streamSource, encoding, DefaultBufferSize)
        {
        }

        internal ReverseLineReader(Func<Stream> streamSource, Encoding encoding, int bufferSize)
        {
            this.streamSource = streamSource;
            this.encoding = encoding;
            this.bufferSize = bufferSize;
            if (encoding.IsSingleByte)
            {
                // For a single byte encoding, every byte is the start (and end) of a character
                characterStartDetector = (pos, data) => true;
            }
            else if (encoding is UnicodeEncoding)
            {
                // For UTF-16, even-numbered positions are the start of a character.
                // TODO: This assumes no surrogate pairs. More work required
                // to handle that.
                characterStartDetector = (pos, data) => (pos & 1) == 0;
            }
            else if (encoding is UTF8Encoding)
            {
                // For UTF-8, bytes with the top bit clear or the second bit set are the start of a character
                // See http://www.cl.cam.ac.uk/~mgk25/unicode.html
                characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0;
            }
            else
            {
                throw new ArgumentException("Only single byte, UTF-8 and Unicode encodings are permitted");
            }
        }

        /// <summary>
        /// Returns the enumerator reading strings backwards. If this method discovers that
        /// the returned stream is either unreadable or unseekable, a NotSupportedException is thrown.
        /// </summary>
        public IEnumerator<string> GetEnumerator()
        {
            Stream stream = streamSource();
            if (!stream.CanSeek)
            {
                stream.Dispose();
                throw new NotSupportedException("Unable to seek within stream");
            }
            if (!stream.CanRead)
            {
                stream.Dispose();
                throw new NotSupportedException("Unable to read within stream");
            }
            return GetEnumeratorImpl(stream);
        }

        private IEnumerator<string> GetEnumeratorImpl(Stream stream)
        {
            try
            {
                long position = stream.Length;

                if (encoding is UnicodeEncoding && (position & 1) != 0)
                {
                    throw new InvalidDataException("UTF-16 encoding provided, but stream has odd length.");
                }

                // Allow up to two bytes for data from the start of the previous
                // read which didn't quite make it as full characters
                byte[] buffer = new byte[bufferSize + 2];
                char[] charBuffer = new char[encoding.GetMaxCharCount(buffer.Length)];
                int leftOverData = 0;
                String previousEnd = null;
                // TextReader doesn't return an empty string if there's line break at the end
                // of the data. Therefore we don't return an empty string if it's our *first*
                // return.
                bool firstYield = true;

                // A line-feed at the start of the previous buffer means we need to swallow
                // the carriage-return at the end of this buffer - hence this needs declaring
                // way up here!
                bool swallowCarriageReturn = false;

                while (position > 0)
                {
                    int bytesToRead = Math.Min(position > int.MaxValue ? bufferSize : (int)position, bufferSize);

                    position -= bytesToRead;
                    stream.Position = position;
                    StreamUtil.ReadExactly(stream, buffer, bytesToRead);
                    // If we haven't read a full buffer, but we had bytes left
                    // over from before, copy them to the end of the buffer
                    if (leftOverData > 0 && bytesToRead != bufferSize)
                    {
                        // Buffer.BlockCopy doesn't document its behaviour with respect
                        // to overlapping data: we *might* just have read 7 bytes instead of
                        // 8, and have two bytes to copy...
                        Array.Copy(buffer, bufferSize, buffer, bytesToRead, leftOverData);
                    }
                    // We've now *effectively* read this much data.
                    bytesToRead += leftOverData;

                    int firstCharPosition = 0;
                    while (!characterStartDetector(position + firstCharPosition, buffer[firstCharPosition]))
                    {
                        firstCharPosition++;
                        // Bad UTF-8 sequences could trigger this. For UTF-8 we should always
                        // see a valid character start in every 3 bytes, and if this is the start of the file
                        // so we've done a short read, we should have the character start
                        // somewhere in the usable buffer.
                        if (firstCharPosition == 3 || firstCharPosition == bytesToRead)
                        {
                            throw new InvalidDataException("Invalid UTF-8 data");
                        }
                    }
                    leftOverData = firstCharPosition;

                    int charsRead = encoding.GetChars(buffer, firstCharPosition, bytesToRead - firstCharPosition, charBuffer, 0);
                    int endExclusive = charsRead;

                    for (int i = charsRead - 1; i >= 0; i--)
                    {
                        char lookingAt = charBuffer[i];
                        if (swallowCarriageReturn)
                        {
                            swallowCarriageReturn = false;
                            if (lookingAt == '\r')
                            {
                                endExclusive--;
                                continue;
                            }
                        }
                        // Anything non-line-breaking, just keep looking backwards
                        if (lookingAt != '\n' && lookingAt != '\r')
                        {
                            continue;
                        }
                        // End of CRLF? Swallow the preceding CR
                        if (lookingAt == '\n')
                        {
                            swallowCarriageReturn = true;
                        }
                        int start = i + 1;
                        string bufferContents = new string(charBuffer, start, endExclusive - start);
                        endExclusive = i;
                        string stringToYield = previousEnd == null ? bufferContents : bufferContents + previousEnd;
                        if (!firstYield || stringToYield.Length != 0)
                        {
                            yield return stringToYield;
                        }
                        firstYield = false;
                        previousEnd = null;
                    }

                    previousEnd = endExclusive == 0 ? null : (new string(charBuffer, 0, endExclusive) + previousEnd);

                    // If we didn't decode the start of the array, put it at the end for next time
                    if (leftOverData != 0)
                    {
                        Buffer.BlockCopy(buffer, 0, buffer, bufferSize, leftOverData);
                    }
                }
                if (leftOverData != 0)
                {
                    // At the start of the final buffer, we had the end of another character.
                    throw new InvalidDataException("Invalid UTF-8 data at start of stream");
                }
                if (firstYield && string.IsNullOrEmpty(previousEnd))
                {
                    yield break;
                }
                yield return previousEnd ?? "";
            }
            finally
            {
                stream.Dispose();
            }
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }
}


// StreamUtil.cs:
public static class StreamUtil
{
    public static void ReadExactly(Stream input, byte[] buffer, int bytesToRead)
    {
        int index = 0;
        while (index < bytesToRead)
        {
            int read = input.Read(buffer, index, bytesToRead - index);
            if (read == 0)
            {
                throw new EndOfStreamException
                    (String.Format("End of stream reached with {0} byte{1} left to read.",
                                   bytesToRead - index,
                                   bytesToRead - index == 1 ? "s" : ""));
            }
            index += read;
        }
    }
}

Feedback very welcome. This was fun :)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Righto - I'm planning to start in about an hour. I should be able to support single-byte encodings, Encoding.Unicode, and Encoding.UTF8. Other dobule-byte encodings won't be supported. I'm expecting testing to be a pain :( – Jon Skeet Jan 17 '09 at 18:04
  • @Jon: would my code do....http://stackoverflow.com/questions/2241012/net-is-there-a-way-to-read-a-txt-file-from-bottom-to-top/2241173#2241173 – t0mm13b Feb 10 '10 at 23:43
  • +1, but feature request: remove BOM - if last (i.e. first) character is 0xFEFF, ignore it. This version adds ? character into last line beginning. – peenut Jun 13 '11 at 17:10
  • @peenut: I think I'd probably deal with that in an iterator wrapped round this one. Try to separate the concerns that way. – Jon Skeet Jun 13 '11 at 17:35
  • Don't mention the edit I made, I was just reading this answer and I came over the non-highlighted `bool`, which I wanted to be highlighted as well. Using the language tag we can specify: http://meta.stackexchange.com/questions/63800/interface-options-for-specifying-language-prettify/81970#81970 But after specifying the language is C#, it still didn't get colored. Roll back if you want to. :) – Martijn Courteaux Sep 24 '11 at 06:22
  • 6
    wow! I know it's more than three years old, but this piece of code rocks! Thanks!! (p.s. I just changed File.OpenRead(filename) with File.Open(filename, FileMode.Open, FileAccess.Read, FileShare.ReadWrite) to let the iterator read already opened files – Stefano Aug 24 '12 at 14:57
  • Lovely (even after 4 years), but I do wonder: what is the rationale behind "sealed"? It's rearing its ugly tail in the whole .NET Framework, and now I see even Jon Skeet is actively applying it (or at least has been 4 years ago). Is it really to say: "Yes, I made some working implementation. No, I'm not going to allow you to inherit, because then I would have to add proper design"? – Grimace of Despair Jul 25 '13 at 10:17
  • 1
    @GrimaceofDespair: More "because then I would have had to design for inheritance, which adds a very significant cost in terms of both design time and future flexibility". Often it's not even clear how inheritance could sensibly be used for a type - better to prohibit it until that clarity has been found, IMO. – Jon Skeet Jul 25 '13 at 10:57
  • Still, it's very possible to f*ck up a wonderful design very easily. So seldom I see the added value of shielding off code like that. IMO, the cost of having to work around sealed and internal just in order to tweak something that should have worked in the first place, is often such a burden, that I'd rather see improper design that *is* extensible. But then again, that's for another lively SO discussion... – Grimace of Despair Jul 25 '13 at 11:04
  • @JonSkeet Your code throwing error that file is used by another process. as its normal that your code is being used by to read log files which are also may be writing by some other service , so need to handle that case also . Thanks – rahularyansharma Oct 20 '13 at 18:09
  • 3
    @rahularyansharma: Whereas I like to split problems into orthogonal aspects. Once you've worked out how to open the file in your case, I'd expect my code to just work for you. – Jon Skeet Oct 21 '13 at 05:44
  • @JonSkeet it looks like this class only disposes the stream when it has finished enumerating. So if I break the foreach early, won't the stream not get disposed? In my case, i just want to read the last X lines of the file, then quit. Does this need to be modified to handle closing the stream early? – DLeh Mar 11 '15 at 17:49
  • @DLeh: It disposes in the `finally` block - which will also get executed if the enumerator is disposed, which will happen if you quit from a foreach loop. – Jon Skeet Mar 11 '15 at 17:51
  • @JonSkeet okay thanks. I wasn't 100% on how enumerators work with `finally`s. Thanks! – DLeh Mar 11 '15 at 17:52
  • Is there a "master" copy of this somewhere that's kept up-to-date with bug fixes? – Matt Houser Apr 23 '15 at 05:51
  • @MattHouser: I have a private source control repo which I'd like to make open at some point, but with no time to do so at the moment. http://jonskeet.uk/csharp/miscutil is the current "home page" of the library, but as you can see it hasn't been updated since 2009... – Jon Skeet Apr 23 '15 at 05:54
  • Which is more up-to-date, this version or jonskeet.uk? I took this code and ran it through a UTF8 file and when I hit the start of the file, it included the last byte of the 3-byte UTF8 "cookie" as the first byte of the first line. – Matt Houser Apr 23 '15 at 06:01
  • 1
    @MattHouser: As far as I'm aware, the two are the same. This is the BOM (byte order mark) character (not byte), and yes, I don't explicitly try to remove it at the moment. It's never clear to me exactly when IO routines *should* remove it - it's present in the file as a character, after all. I suggest that right now, you either ensure that your files don't start with the BOM or replace it yourself (`line = line.Replace("\ufeff", "")`) – Jon Skeet Apr 23 '15 at 06:05
  • That worked. Thanks. It works great. Please get this code into GitHub (or equivalent) and NuGet :) Googling for 'ReverseLineReader' finds some copies littered around. – Matt Houser Apr 23 '15 at 06:15
  • @MattHouser: It's already in NuGet, albeit in a prerelease form as I'd probably want to change namespaces: https://www.nuget.org/packages/JonSkeet.MiscUtil/ – Jon Skeet Apr 23 '15 at 06:16
  • 1
    If anyone wants to be able to share the file with another process, such as when you want to read a log file that's open for writing by the parent, just replace: File.OpenRead(filename) With: new FileStream(filename, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite) – mostlydev May 27 '15 at 19:55
  • I can't seem to dispose of the underlying stream except if I loop entirely or I break the loop. I tried doing a `GetEnumerator` then call `Dispose` on it, but afterward the file is still in use. – user276648 Mar 28 '17 at 09:30
  • @user276648: It should be fine... are you entering the loop *at all*? If you don't call `MoveNext()` even once, I can see how that *could* be an issue... – Jon Skeet Mar 28 '17 at 10:13
  • You're right sorry. I tried again and it works: `GetEnumerator` then `MoveNext` then `Dispose`. The file isn't used anymore. If you never call `MoveNext` though, even calling `Dispose` won't dispose of the underlying stream as you mentioned, but that shouldn't really happen. – user276648 Mar 29 '17 at 02:43
  • @user276648: It still counts as a bug IMO, but I'm not in a position to fix it right now :( – Jon Skeet Mar 29 '17 at 05:43
  • No problem, anyway it's quite simple. For anyone interested, you create a child class implementing `IEnumerator` that uses the `GetEnumeratorImpl` (to which you'll need to pass the `bufferSize`, `encoding`, `characterStartDetector`), and in the `Dispose` method you dispose the stream. – user276648 Mar 29 '17 at 09:21
  • @JonSkeet UTF-16 can either use 2 byte or 4 byte per character, but your `characterStartDetector` considers all 2 byte boundaries as a start of a character. Wouldn't it be incorrect when reading the 4-byte characters? – Vikhram Aug 16 '17 at 17:09
  • @Vikhram: Quite possibly. I don't have the time to fix it right now, but I'll add a TODO. – Jon Skeet Aug 16 '17 at 17:25
  • @JonSkeet Thanks Jon – Vikhram Aug 16 '17 at 17:26
  • Wonderfull code.. but.. This code closes my carefully managed stream I provide to it And this whilst not using a using statement. It should implement IDisposable.. so I have at least some control. Use case: I process an enormous data file which contains dates.. I read the first line, I read the last line.. so I can show a progress bar. After reading the last line.. the stream is closed and I have to reopen it. :'( – JDC Jan 22 '19 at 12:15
  • @JDC: Yes, in most cases I believe that's what's most useful. But obviously you can change the code to suit your own use case. Note that any time you use `foreach` you implicitly *do* use a `using` statement, so I don't think it's unreasonable to close the stream. That's why the constructor accepts a `Func` rather than a `Stream`. You might want to consider using the code as-is, but having a wrapper `Stream` that delegates calls *other* than `Close`/`Dispose` to an underlying stream. – Jon Skeet Jan 22 '19 at 12:22
  • Here's an example of how to use it to iterate over lines in a log file: https://github.com/projectkudu/kudu/blob/fdffce21d012d7691224bd3814d114f958f7e54f/Kudu.Services/Diagnostics/ApplicationLogsReader.cs#L237 – Randy Burden Mar 29 '21 at 18:56
  • I tried this with `foreach` and had no problem. However, using `GetEnumerator()` does not work properly. `Current` is always null. – D.Go Jun 07 '21 at 21:31
  • @D.Go: That suggests you didn't call `MoveNext()`. – Jon Skeet Jun 07 '21 at 21:32
  • @JonSkeet I have your book, and you rock. Any tips on where one might learn more about how to handle the cases your class handles - such as the various different encodings? – SpiritBob Jun 10 '21 at 07:20
  • 1
    @SpiritBob: I'm afraid that's too broad a question to answer, really. I'd start by trying to understand all the encodings you're interested in, and what problem you're trying to solve. – Jon Skeet Jun 10 '21 at 07:37
  • @JonSkeet Thanks. I was using `GetEnumerator()` incorrectly. I called it multiple times. – D.Go Jun 13 '21 at 02:27
8

Attention: this approach doesn't work (explained in EDIT)

You could use File.ReadLines to get lines iterator

foreach (var line in File.ReadLines(@"C:\temp\ReverseRead.txt").Reverse())
{
    if (noNeedToReadFurther)
        break;

    // process line here
    Console.WriteLine(line);
}

EDIT:

After reading applejacks01's comment, I run some tests and it does look like .Reverse() actually loads whole file.

I used File.ReadLines() to print first line of a 40MB file - memory usage of console app was 5MB. Then, used File.ReadLines().Reverse() to print last line of same file - memory usage was 95MB.

Conclusion

Whatever `Reverse()' is doing, it is not a good choice for reading bottom of a big file.

Community
  • 1
  • 1
Roman Gudkov
  • 3,503
  • 2
  • 20
  • 20
  • 3
    I wonder if the call to Reverse DOES actually load the whole file into memory. Wouldn't the ending point of the Enumerable need to be established first? I.e internally, the enumerable fully enumerates the file to create a temp array, which is then Reversed, which then is enumerated one by one using the yield keyword such that a new Enumerable is created iterating in the reverse order – applejacks01 Jul 26 '16 at 12:48
  • 3
    Original answer was wrong, but I keep EDITED answer here as it may prevent other people from using this approach. – Roman Gudkov May 19 '17 at 10:29
8

Very fast solution for huge files: From C#, use PowerShell's Get-Content with the Tail parameter.

using System.Management.Automation;

using (PowerShell powerShell = PowerShell.Create())
{
    string lastLine = powerShell.AddCommand("Get-Content")
        .AddParameter("Path", @"c:\a.txt")
        .AddParameter("Tail", 1)
        .Invoke().FirstOrDefault()?.ToString();
}

Required reference: 'System.Management.Automation.dll' - may be somewhere like 'C:\Program Files (x86)\Reference Assemblies\Microsoft\WindowsPowerShell\3.0'

Using PowerShell incurs a small overhead but is worth it for huge files.

Andrew D. Bond
  • 902
  • 1
  • 11
  • 11
Didar_Uranov
  • 1,230
  • 11
  • 26
3

To create a file iterator you can do this:

EDIT:

This is my fixed version of a fixed-width reverse file reader:

public static IEnumerable<string> readFile()
{
    using (FileStream reader = new FileStream(@"c:\test.txt",FileMode.Open,FileAccess.Read))
    {
        int i=0;
        StringBuilder lineBuffer = new StringBuilder();
        int byteRead;
        while (-i < reader.Length)
        {
            reader.Seek(--i, SeekOrigin.End);
            byteRead = reader.ReadByte();
            if (byteRead == 10 && lineBuffer.Length > 0)
            {
                yield return Reverse(lineBuffer.ToString());
                lineBuffer.Remove(0, lineBuffer.Length);
            }
            lineBuffer.Append((char)byteRead);
        }
        yield return Reverse(lineBuffer.ToString());
        reader.Close();
    }
}

public static string Reverse(string str)
{
    char[] arr = new char[str.Length];
    for (int i = 0; i < str.Length; i++)
        arr[i] = str[str.Length - 1 - i];
    return new string(arr);
}
Igor Zelaya
  • 4,167
  • 4
  • 35
  • 52
  • That's now close to being correct for ISO-8859-1, but not for any other encoding. Encodings make this really tricky :( – Jon Skeet Jan 17 '09 at 14:47
  • What do you mean by "close to being correct for ISO-8859-1"? What is still missing? – Igor Zelaya Jan 17 '09 at 17:02
  • The handling isn't quite right to match "\r" "\n" and "\r\n" where the latter ends up only counting as a single line break. – Jon Skeet Jan 17 '09 at 18:05
  • 1
    It also never yields empty lines - "a\n\nb" should yield "a", "", "b" – Jon Skeet Jan 17 '09 at 18:07
  • mmmmmm...I am yielding the lineBuffer only when I find a '\n'(ASCII 10). You are right, I am not taking into accoun '\r'. – Igor Zelaya Jan 18 '09 at 17:43
  • mmmmmmm. I am also not so sure about yielding empty lines. Is that the default behavior of the StreamReader classs when calling ReadLine() Method? – Igor Zelaya Jan 18 '09 at 17:45
3

I also add my solution. After reading some answers, nothing really fit to my case. I'm reading byte by byte from from behind until I find a LineFeed, then I'm returing the collected bytes as string, without using buffering.

Usage:

var reader = new ReverseTextReader(path);
while (!reader.EndOfStream)
{
    Console.WriteLine(reader.ReadLine());  
}

Implementation:

public class ReverseTextReader
{
    private const int LineFeedLf = 10;
    private const int LineFeedCr = 13;
    private readonly Stream _stream;
    private readonly Encoding _encoding;

    public bool EndOfStream => _stream.Position == 0;

    public ReverseTextReader(Stream stream, Encoding encoding)
    {
        _stream = stream;
        _encoding = encoding;
        _stream.Position = _stream.Length;
    }

    public string ReadLine()
    {
        if (_stream.Position == 0) return null;

        var line = new List<byte>();
        var endOfLine = false;
        while (!endOfLine)
        {
            var b = _stream.ReadByteFromBehind();

            if (b == -1 || b == LineFeedLf)
            {
                endOfLine = true;
            } 
            line.Add(Convert.ToByte(b));
        }

        line.Reverse();
        return _encoding.GetString(line.ToArray());
    }
}

public static class StreamExtensions
{
    public static int ReadByteFromBehind(this Stream stream)
    {
        if (stream.Position == 0) return -1;

        stream.Position = stream.Position - 1;
        var value = stream.ReadByte();
        stream.Position = stream.Position - 1;
        return value;
    }
}
David
  • 505
  • 3
  • 8
2

I put the file into a list line by line, then used List.Reverse();

        StreamReader objReader = new StreamReader(filename);
        string sLine = "";
        ArrayList arrText = new ArrayList();

        while (sLine != null)
        {
            sLine = objReader.ReadLine();
            if (sLine != null)
                arrText.Add(sLine);
        }
        objReader.Close();


        arrText.Reverse();

        foreach (string sOutput in arrText)
        {

...

chris
  • 231
  • 2
  • 2
  • 5
    Not the best solution for big files since you need to load it entirely into RAM. And the OP explicitly specified that he doesn't want to load it completely. – CodesInChaos Dec 26 '10 at 14:49
1

There are good answers here already, and here's another LINQ-compatible class you can use which focuses on performance and support for large files. It assumes a "\r\n" line terminator.

Usage:

var reader = new ReverseTextReader(@"C:\Temp\ReverseTest.txt");
while (!reader.EndOfStream)
    Console.WriteLine(reader.ReadLine());

ReverseTextReader Class:

/// <summary>
/// Reads a text file backwards, line-by-line.
/// </summary>
/// <remarks>This class uses file seeking to read a text file of any size in reverse order.  This
/// is useful for needs such as reading a log file newest-entries first.</remarks>
public sealed class ReverseTextReader : IEnumerable<string>
{
    private const int BufferSize = 16384;   // The number of bytes read from the uderlying stream.
    private readonly Stream _stream;        // Stores the stream feeding data into this reader
    private readonly Encoding _encoding;    // Stores the encoding used to process the file
    private byte[] _leftoverBuffer;         // Stores the leftover partial line after processing a buffer
    private readonly Queue<string> _lines;  // Stores the lines parsed from the buffer

    #region Constructors

    /// <summary>
    /// Creates a reader for the specified file.
    /// </summary>
    /// <param name="filePath"></param>
    public ReverseTextReader(string filePath)
        : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), Encoding.Default)
    { }

    /// <summary>
    /// Creates a reader using the specified stream.
    /// </summary>
    /// <param name="stream"></param>
    public ReverseTextReader(Stream stream)
        : this(stream, Encoding.Default)
    { }

    /// <summary>
    /// Creates a reader using the specified path and encoding.
    /// </summary>
    /// <param name="filePath"></param>
    /// <param name="encoding"></param>
    public ReverseTextReader(string filePath, Encoding encoding)
        : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), encoding)
    { }

    /// <summary>
    /// Creates a reader using the specified stream and encoding.
    /// </summary>
    /// <param name="stream"></param>
    /// <param name="encoding"></param>
    public ReverseTextReader(Stream stream, Encoding encoding)
    {          
        _stream = stream;
        _encoding = encoding;
        _lines = new Queue<string>(128);            
        // The stream needs to support seeking for this to work
        if(!_stream.CanSeek)
            throw new InvalidOperationException("The specified stream needs to support seeking to be read backwards.");
        if (!_stream.CanRead)
            throw new InvalidOperationException("The specified stream needs to support reading to be read backwards.");
        // Set the current position to the end of the file
        _stream.Position = _stream.Length;
        _leftoverBuffer = new byte[0];
    }

    #endregion

    #region Overrides

    /// <summary>
    /// Reads the next previous line from the underlying stream.
    /// </summary>
    /// <returns></returns>
    public string ReadLine()
    {
        // Are there lines left to read? If so, return the next one
        if (_lines.Count != 0) return _lines.Dequeue();
        // Are we at the beginning of the stream? If so, we're done
        if (_stream.Position == 0) return null;

        #region Read and Process the Next Chunk

        // Remember the current position
        var currentPosition = _stream.Position;
        var newPosition = currentPosition - BufferSize;
        // Are we before the beginning of the stream?
        if (newPosition < 0) newPosition = 0;
        // Calculate the buffer size to read
        var count = (int)(currentPosition - newPosition);
        // Set the new position
        _stream.Position = newPosition;
        // Make a new buffer but append the previous leftovers
        var buffer = new byte[count + _leftoverBuffer.Length];
        // Read the next buffer
        _stream.Read(buffer, 0, count);
        // Move the position of the stream back
        _stream.Position = newPosition;
        // And copy in the leftovers from the last buffer
        if (_leftoverBuffer.Length != 0)
            Array.Copy(_leftoverBuffer, 0, buffer, count, _leftoverBuffer.Length);
        // Look for CrLf delimiters
        var end = buffer.Length - 1;
        var start = buffer.Length - 2;
        // Search backwards for a line feed
        while (start >= 0)
        {
            // Is it a line feed?
            if (buffer[start] == 10)
            {
                // Yes.  Extract a line and queue it (but exclude the \r\n)
                _lines.Enqueue(_encoding.GetString(buffer, start + 1, end - start - 2));
                // And reset the end
                end = start;
            }
            // Move to the previous character
            start--;
        }
        // What's left over is a portion of a line. Save it for later.
        _leftoverBuffer = new byte[end + 1];
        Array.Copy(buffer, 0, _leftoverBuffer, 0, end + 1);
        // Are we at the beginning of the stream?
        if (_stream.Position == 0)
            // Yes.  Add the last line.
            _lines.Enqueue(_encoding.GetString(_leftoverBuffer, 0, end - 1));

        #endregion

        // If we have something in the queue, return it
        return _lines.Count == 0 ? null : _lines.Dequeue();
    }

    #endregion

    #region IEnumerator<string> Interface

    public IEnumerator<string> GetEnumerator()
    {
        string line;
        // So long as the next line isn't null...
        while ((line = ReadLine()) != null)
            // Read and return it.
            yield return line;
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        throw new NotImplementedException();
    }

    #endregion
}
Jon Person
  • 126
  • 1
  • 3
  • Old article but had hard time making backward reader. This one actually works and it's fast, one small change I made is to implement as IDisposable for safer execution. – Dima Sherba Oct 12 '21 at 12:34
1

You can read the file one character at a time backwards and cache all characters until you reach a carriage return and/or line feed.

You then reverse the collected string and yeld it as a line.

idstam
  • 2,848
  • 1
  • 21
  • 30
  • 4
    Reading a file one character at a time backwards is hard though - because you've got to be able to recognise the start of a character. How simple that is will depend on the encoding. – Jon Skeet Jan 17 '09 at 08:15
1

I know this post is very old but as I couldn't find how to use the most voted solution, I finally found this: here is the best answer I found with a low memory cost in VB and C#

http://www.blakepell.com/2010-11-29-backward-file-reader-vb-csharp-source

Hope, I'll help others with that because it tooks me hours to finally find this post!

[Edit]

Here is the c# code :

//*********************************************************************************************************************************
//
//             Class:  BackwardReader
//      Initial Date:  11/29/2010
//     Last Modified:  11/29/2010
//     Programmer(s):  Original C# Source - the_real_herminator
//                     http://social.msdn.microsoft.com/forums/en-US/csharpgeneral/thread/9acdde1a-03cd-4018-9f87-6e201d8f5d09
//                     VB Converstion - Blake Pell
//
//*********************************************************************************************************************************

using System.Text;
using System.IO;
public class BackwardReader
{
    private string path;
    private FileStream fs = null;
    public BackwardReader(string path)
    {
        this.path = path;
        fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
        fs.Seek(0, SeekOrigin.End);
    }
    public string Readline()
    {
        byte[] line;
        byte[] text = new byte[1];
        long position = 0;
        int count;
        fs.Seek(0, SeekOrigin.Current);
        position = fs.Position;
        //do we have trailing rn?
        if (fs.Length > 1)
        {
            byte[] vagnretur = new byte[2];
            fs.Seek(-2, SeekOrigin.Current);
            fs.Read(vagnretur, 0, 2);
            if (ASCIIEncoding.ASCII.GetString(vagnretur).Equals("rn"))
            {
                //move it back
                fs.Seek(-2, SeekOrigin.Current);
                position = fs.Position;
            }
        }
        while (fs.Position > 0)
        {
            text.Initialize();
            //read one char
            fs.Read(text, 0, 1);
            string asciiText = ASCIIEncoding.ASCII.GetString(text);
            //moveback to the charachter before
            fs.Seek(-2, SeekOrigin.Current);
            if (asciiText.Equals("n"))
            {
                fs.Read(text, 0, 1);
                asciiText = ASCIIEncoding.ASCII.GetString(text);
                if (asciiText.Equals("r"))
                {
                    fs.Seek(1, SeekOrigin.Current);
                    break;
                }
            }
        }
        count = int.Parse((position - fs.Position).ToString());
        line = new byte[count];
        fs.Read(line, 0, count);
        fs.Seek(-count, SeekOrigin.Current);
        return ASCIIEncoding.ASCII.GetString(line);
    }
    public bool SOF
    {
        get
        {
            return fs.Position == 0;
        }
    }
    public void Close()
    {
        fs.Close();
    }
}
JC Frigon
  • 41
  • 5
  • You should include the relevant parts from the link in your answer and add the link for reference only, so that your answer still adds value even if the link changes. – Thomas Flinkow Apr 09 '18 at 19:56
  • If you have private `IDisposable` fields, you should implement `IDisposable` too, and properly dispose of these fields. – thomasb Dec 18 '19 at 11:20
  • To make this code work, the "n" & "r" should be replaced with "\n" & "\r". Unfortunately, this code although worked after fix, it's very slow even for smaller files, check Jon Person's solution. – Dima Sherba Oct 12 '21 at 12:29
0

I wanted to do the similar thing. Here is my code. This class will create temporary files containing chunks of the big file. This will avoid memory bloating. User can specify whether s/he wants the file reversed. Accordingly it will return the content in reverse manner.

This class can also be used to write big data in a single file without bloating memory.

Please provide feedback.

        using System;
        using System.Collections.Generic;
        using System.Diagnostics;
        using System.IO;
        using System.Linq;
        using System.Text;
        using System.Threading.Tasks;

        namespace BigFileService
        {    
            public class BigFileDumper
            {
                /// <summary>
                /// Buffer that will store the lines until it is full.
                /// Then it will dump it to temp files.
                /// </summary>
                public int CHUNK_SIZE = 1000;
                public bool ReverseIt { get; set; }
                public long TotalLineCount { get { return totalLineCount; } }
                private long totalLineCount;
                private int BufferCount = 0;
                private StreamWriter Writer;
                /// <summary>
                /// List of files that would store the chunks.
                /// </summary>
                private List<string> LstTempFiles;
                private string ParentDirectory;
                private char[] trimchars = { '/', '\\'};


                public BigFileDumper(string FolderPathToWrite)
                {
                    this.LstTempFiles = new List<string>();
                    this.ParentDirectory = FolderPathToWrite.TrimEnd(trimchars) + "\\" + "BIG_FILE_DUMP";
                    this.totalLineCount = 0;
                    this.BufferCount = 0;
                    this.Initialize();
                }

                private void Initialize()
                {
                    // Delete existing directory.
                    if (Directory.Exists(this.ParentDirectory))
                    {
                        Directory.Delete(this.ParentDirectory, true);
                    }

                    // Create a new directory.
                    Directory.CreateDirectory(this.ParentDirectory);
                }

                public void WriteLine(string line)
                {
                    if (this.BufferCount == 0)
                    {
                        string newFile = "DumpFile_" + LstTempFiles.Count();
                        LstTempFiles.Add(newFile);
                        Writer = new StreamWriter(this.ParentDirectory + "\\" + newFile);
                    }
                    // Keep on adding in the buffer as long as size is okay.
                    if (this.BufferCount < this.CHUNK_SIZE)
                    {
                        this.totalLineCount++; // main count
                        this.BufferCount++; // Chunk count.
                        Writer.WriteLine(line);
                    }
                    else
                    {
                        // Buffer is full, time to create a new file.
                        // Close the existing file first.
                        Writer.Close();
                        // Make buffer count 0 again.
                        this.BufferCount = 0;
                        this.WriteLine(line);
                    }
                }

                public void Close()
                {
                    if (Writer != null)
                        Writer.Close();
                }

                public string GetFullFile()
                {
                    if (LstTempFiles.Count <= 0)
                    {
                        Debug.Assert(false, "There are no files created.");
                        return "";
                    }
                    string returnFilename = this.ParentDirectory + "\\" + "FullFile";
                    if (File.Exists(returnFilename) == false)
                    {
                        // Create a consolidated file from the existing small dump files.
                        // Now this is interesting. We will open the small dump files one by one.
                        // Depending on whether the user require inverted file, we will read them in descending order & reverted, 
                        // or ascending order in normal way.

                        if (this.ReverseIt)
                            this.LstTempFiles.Reverse();

                        foreach (var fileName in LstTempFiles)
                        {
                            string fullFileName = this.ParentDirectory + "\\" + fileName;
// FileLines will use small memory depending on size of CHUNK. User has control.
                            var fileLines = File.ReadAllLines(fullFileName);

                            // Time to write in the writer.
                            if (this.ReverseIt)
                                fileLines = fileLines.Reverse().ToArray();

                            // Write the lines 
                            File.AppendAllLines(returnFilename, fileLines);
                        }
                    }

                    return returnFilename;
                }
            }
        }

This service can be used as follows -

void TestBigFileDump_File(string BIG_FILE, string FOLDER_PATH_FOR_CHUNK_FILES)
        {
            // Start processing the input Big file.
            StreamReader reader = new StreamReader(BIG_FILE);
            // Create a dump file class object to handle efficient memory management.
            var bigFileDumper = new BigFileDumper(FOLDER_PATH_FOR_CHUNK_FILES);
            // Set to reverse the output file.
            bigFileDumper.ReverseIt = true;
            bigFileDumper.CHUNK_SIZE = 100; // How much at a time to keep in RAM before dumping to local file.

            while (reader.EndOfStream == false)
            {
                string line = reader.ReadLine();
                bigFileDumper.WriteLine(line);
            }
            bigFileDumper.Close();
            reader.Close();

            // Get back full reversed file.
            var reversedFilename = bigFileDumper.GetFullFile();
            Console.WriteLine("Check output file - " + reversedFilename);
        }
ashish
  • 425
  • 4
  • 11
0

In case anyone else comes across this, I solved it with the following PowerShell script which can easily be modified into a C# script with a small amount of effort.

[System.IO.FileStream]$fileStream = [System.IO.File]::Open("C:\Name_of_very_large_file.log", [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
[System.IO.BufferedStream]$bs = New-Object System.IO.BufferedStream $fileStream;
[System.IO.StreamReader]$sr = New-Object System.IO.StreamReader $bs;


$buff = New-Object char[] 20;
$seek = $bs.Seek($fileStream.Length - 10000, [System.IO.SeekOrigin]::Begin);

while(($line = $sr.ReadLine()) -ne $null)
{
     $line;
}

This basically starts reading from the last 10,000 characters of a file, outputting each line.

user1913559
  • 301
  • 1
  • 13
  • 1
    This will read forward from the last 10,000 bytes, not backward from the end to the start. Also, why not just `.Seek(-10000, [System.IO.SeekOrigin]::End);`? – AKX Oct 23 '18 at 10:48