0

In .net/C#, I need o parse a large file, so should not be loaded in memory entirely at once. Is there an optimized technique to read line by line and process them, while keeping in memory the last n lines read, in order to go back and forth through them? What collection would be best suitable for such operation?

mike
  • 408
  • 5
  • 18
  • There isn't a collection for this in the .NET Framework or .NET Core, you'll probably need to write your own class wrapping a list that removes the oldest element when adding a new one when the length is equal to n. But can you post a [mcve] so we can see what you tried already? – MindSwipe Nov 22 '19 at 08:21
  • 1
    Could you please show what you tried so far? It seems to me it's natural to read file using StreamReader and buffering of certain amount of lines is not that straightforward. However, you could look for some circular buffer implementation or write your own. – Stas Ivanov Nov 22 '19 at 08:27
  • You could indeed make a class inheriting from a queue, where you would simply pop a value each time you push one. but you'd also have to define a max amount in that queue. – Glenn van Acker Nov 22 '19 at 08:30
  • Interesting question, but what have you tried so far ? – Fourat Nov 22 '19 at 08:49
  • @GlennvanAcker a wrapper around Queue wouldn't work, as OP needs to get last n elements, and with a Queue you can only get (dequeue) the first element. There would need to be quite a bit of finagling around to get the last 3 elements (remove and buffer the first few elements, get the last 3, push the first elements back on in the correct order and then return the the last 3) – MindSwipe Nov 22 '19 at 09:46
  • @MindSwipe It's perfectly possible to achieve that with a getter. you could simply Queue.ToList() each time you want to look at it's values. but then it would indeed be easier to inherit from a list, and let it behave as a queue when pushing an popping. – Glenn van Acker Nov 22 '19 at 09:51
  • I would need to seek back line by line from the current line. Queue would have been fine, if it had an index. – mike Nov 22 '19 at 09:56
  • @GlennvanAcker true, I contradicted myself in my comment as well, my bad. I meant "possible, but difficult and or slow(er)". Also I dislike inheriting from List for a number of reasons, I much prefer wrapping it ([this](https://stackoverflow.com/q/21692193/9363973) Q&A explains it better than I ever could). It's a little more work to get the class, but it's cleaner and safer to use and reuse – MindSwipe Nov 22 '19 at 09:57
  • 1
    @MindSwipe I don't see the issue in inheriting from a list, or any IEnumerable, but encapsulation is also a good option. there's many approaches to solving this problem. – Glenn van Acker Nov 22 '19 at 10:05
  • 2
    @GlennvanAcker one problem I see with inheriting from List ist that List has the Method `RemoveAt(int index)`, which we do not want, because this can cause holes in the List, leading to us having to deal with them as well. Furthermore, `Add` is not virtual, we cannot overwrite `Insert(T item, int index)`, meaning a consumer of the API could insert data anywhere into the List, which we also don't want. Implementing IEnumerable isn't a problem however – MindSwipe Nov 22 '19 at 10:15
  • @MindSwipe, oh, i did not know that. Another way would be to inherit both from List and IEnumerable, and use an explicit interface implementation. That way, you can keep the internal workings hidden, but you'd need to declare it as an Ienumerable. You couldn't override it, but you could implement the Add method, and in your private method, use the base class implementation (or not) – Glenn van Acker Nov 22 '19 at 10:24

2 Answers2

2

For this you'll need a custom collection type, an array that you can continually add to, but doesn't resize and instead deletes the old entries. I came up with this after trying for a few minutes, it is extremely rough, no validation for anything and may contain logic errors for some cases, but it seems to work. (Also I'm not happy with the class name, so if you have a better idea tell me)

class SizeLimitedList<T> : IEnumerable<T>
{
    private T[] _internalArray;

    private readonly int _capacity;

    public SizeLimitedList(int capacity)
    {
        _internalArray = new T[capacity];
        _capacity = capacity;
    }

    public SizeLimitedList(IEnumerable<T> collection)
    {
        _internalArray = collection.ToArray();
        _capacity = _internalArray.Length;
    }

    public void Add(T item)
    {
        MoveArray(1);
        _internalArray[_capacity - 1] = item;
    }

    public T[] GetLastEntries(int n)
    {
        return _internalArray.Skip(_internalArray.Length - n).ToArray();
    }

    public T GetLastEntry()
    {
        return _internalArray[_internalArray.Length - 1];
    }

    private void MoveArray(int by)
    {
        Array.Copy(_internalArray, 1, _internalArray, 0, _capacity - 1);
    }

    public IEnumerator<T> GetEnumerator()
    {
        return _internalArray.AsEnumerable().GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

You can use it like so:

var list = new SizeLimitedList<string>(maxLinesKept);
var file = new StreamReader(@"C:\My\Path\To\File.txt");
while((line = file.ReadLine()) != null)
{
    list.Add(line);
    if (/* Condition that requires you to read the last n lines */)
    {
        var lines = list.GetLastEntries(nLinesToGet);
        // Do whatever with these last lines
    }
}

This is a little complicated, so let me make an example. Let's say you want to print a line to the console, but only when the previous line contains "Print Next Line:"

Print Next Line:
Hello
This line will not be printed
Print Next Line:
World

So now let's implement it:

var list = new SizeLimitedList<string>(1);
var file = new StreamReader("example.txt");

while((line = file.ReadLine()) != null)
{
    if (list.GetLastEntry() == "Print Next Line:")
        Console.WriteLine(line);

    list.Add(line);
}

This will print:

Hello
World

Into the console

P.S Feel free to leave a comment or update your original question with a sample of your file and the condition of when to read the last n lines and I can update my example to match your use-case

MindSwipe
  • 7,193
  • 24
  • 47
  • I'm still stuck with C# 7. – mike Nov 22 '19 at 09:57
  • Ok, let me update the answer to take that into account – MindSwipe Nov 22 '19 at 09:59
  • Is it better than removing the first element of a list and adding something at the end, concerning the memory allocation and fragmentation? – mike Nov 22 '19 at 10:01
  • 1
    Removing the first item of the list and adding one to the end is a little more memory consuming, as the List may run out of space and need to resize itself, which in turn may trigger the GC to clean up, which will halt the execution for a split second. Although it may sound that the List will be slower, the array I'm using is being copied everytime you add something and it's out of space, which is a cheap operation, but it still a minuscule amount takes time. Fragmentation won't be a problem, as we aren't actually removing anything, but overwriting data. And that only when the rest is full – MindSwipe Nov 22 '19 at 10:11
  • Doesn't var newArray = new T[_internalArray.Length]; lead to fragmentation? Why not using Array.Copy to move/shift in the same array? – mike Nov 22 '19 at 12:08
  • I can replace it with copying into the same array, I wasn't sure that it would work when I wrote the code initially so I just decided to go with what know works. Also it doesn't cause fragmentation, as we are simply moving every element of the array left by n and dropping the first n elements. So the order of the elements is conserved, just that we now have n empty spaces at the end. Fragmentation would mean that we suddenly have holes in our array, which we do not – MindSwipe Nov 22 '19 at 12:17
  • I was referring to fragmentation as in string s = "abc" + "ttt" + "123"; s += "more stuff"; for (i = 1; i <= 100; i++) s += i.ToString(); etc.. – mike Nov 22 '19 at 12:26
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202889/discussion-between-mindswipe-and-mike). – MindSwipe Nov 22 '19 at 12:27
0

Well, you can use class BackwardReader which can be found here. I don't exactly know if it will be helpful because I don't how do you want to process previous lines till you get to the last N. Anyway, you can use this class to start reading backwards, save first N lines and then process other lines.

public static void ReadFile(int n, string logFile)
{
    int lineCnt = 0;
    List <string> lastNLines= new List <string>();

    BackwardReader br = new BackwardReader(logFile);
    while (!br.SOF())
    {
        string line = br.Readline();
        if (lineCnt < n) lastNLines.Add(line);
        // else your implementation for other lines
        lineCnt++;
    }
}
gjiki
  • 169
  • 1
  • 6
  • An interesting approach, could be a solution to the problem in case a combined solution to read back and forth is implemented. It is not what I have asked, though, but I'll take it into account. – mike Nov 22 '19 at 09:05