Need help in understanding the explanation by Microsoft for File.ReadLines and File.ReadAllLines

Question

According to the explanation by Microsoft for The ReadLines and ReadAllLines methods, When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned. When you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient.

What does it actually mean when they say:

1 - "When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned." If the below line of code is written, then doesn't it mean that ReadLines method execution is over and that the whole collection is returned & stored in variable filedata?

IEnumerable<String> filedata = File.ReadLines(fileWithPath)

2 - "When you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array". Does it mean that, in the below code snippet if a large file is read then the array variable hugeFileData will not have all the data if used immediately after the file was read?

string[] hugeFileData = File.ReadAllLines(path)
string i = hugeFileData[hugeFileData.length-1];

3 - "when you are working with very large files, ReadLines can be more efficient". If that is so, is the below code efficient when reading large file? I believe that the 2nd and 3rd line the below code would read the file twice, correct me if I am wrong.

string fileWithPath = "some large sized file path";
string lastLine = File.ReadLines(fileWithPath).Last();
int totalLines = File.ReadLines(fileWithPath).Count();

The reason of calling ReadLines on the same file twice in the above code snippet is that when I tried the below code, I got an exception "Cannot read from a closed TextReader" on the 3rd line in the below code snippet.

IEnumerable<String> filedata = File.ReadLines(fileWithPath);
string lastLine = filedata.Last();
int totalLines = filedata.Count();

I suspect the difference is using `yield return` under the hood in `File.ReadLines`. Are you familiar with the concept? It's basically: the method conhstantly returns some data when it's done with it rather than storing it into an intermediate variable, piling all its data onto eachother and then handing you a pile. — Jeroen Vannevel, Jul 23 '14 at 17:10
@JeroenVannevel From a black-box perspective he only really needs to understand *deferred execution* and the idea that `File.ReadAllLines` *buffers* the sequence as where `File.ReadLines` *streams* the sequence (although an understanding of iterator blocks will help, I agree). — User 12345678, Jul 23 '14 at 17:11
It means that `ReadLines` uses streaming to fill in the array. Since .Net enumeration supports this kind of streaming, you can access the array through enumeration. I suspect that the `Count` method does not support this streaming. (which would make sense, how could it?) — RBarryYoung, Jul 23 '14 at 17:13
Dear All, Thank you for such quick responses and my apologies for delayed reply. I wanted to mark responses from Jim Mischel and Servy as answers, but I suppose SO allows only one response to be marked as accepted answer. All the answers and comments were in someway helpful. — Piush, Jul 24 '14 at 19:24

Jim Mischel · Accepted Answer · 2014-07-23T19:19:00.360

The difference between ReadLines and ReadAllLines is easily illustrated by code.

If you write this:

foreach (var line in File.ReadLines(filename))
{
    Console.WriteLine(line);
}

What happens is similar to this:

using (var reader = new StreamReader(filename))
{
    while (!reader.EndOfStream)
    {
        var line = reader.ReadLine();
        Console.WriteLine(line);
    }
}

The actual code generated is a little more complex (ReadLines returns an enumerator whose MoveNext method reads and returns each line), but from the outside the behavior is similar.

The key to that behavior is deferred execution, which you should understand well in order to make good use of LINQ. So the answer to your first question is "No." All the call to ReadLines does is open the file and return an enumerator. It doesn't read the first line until you ask for it.

Note here that the code can output the first line before the second line is even read. In addition, you're only using memory for one line at a time.

ReadAllLines has much different behavior. When you write:

foreach (var line in File.ReadAllLines(filename))
{
    Console.WriteLine(line);
}

What actually happens is more like this:

List<string> lines = new List<string>();
using (var reader = new StreamReader(filename))
{
    while (!reader.EndOfStream)
    {
        var line = reader.ReadLine();
        lines.Add(line);
    }
}
foreach (var line in lines)
{
    Console.WriteLine(line);
}

Here, the program has to load the entire file into memory before it can output the first line.

Which one you use depends on what you want to do. If you just need to access the file line-by-line, then ReadLines is usually the better choice--especially for large files. But if you want to access lines randomly or if you'll be reading the file multiple times, then ReadAllLines might be better. However, remember that ReadAllLines requires that you have enough memory to hold the entire file.

In your third question you showed this code, which produced an exception on the last line:

IEnumerable<String> filedata = File.ReadLines(fileWithPath);
string lastLine = filedata.Last();
int totalLines = filedata.Count();

What happened here is that the first line returned an enumerator. The second line of code enumerated the entire sequence (i.e. read to the end of the file) so that it could find the last line. The enumerator saw that it was at end of file and closed the associated reader. The last line of code again tries to enumerate the file, but the file was already closed. There's no "reset to the start of the file" functionality in the enumerator returned by ReadLines.

Before posting my query here I did understood from other posts and forums related to **bold** ReadLines, that the enumerator that was returned could only be traversed in once. so I tried the below solution IEnumerable filedata = File.ReadLines(fileWithPath).Reverse(); string lastLine = filedata.first(); // since the file was read from end, the first line returned by the enumerator will actually be the last line of file. int totalLines = filedata.Count(); Assuming that the 2nd line in above code, has not moved the iterator at the end, the 3rd line should return — Piush, Jul 24 '14 at 19:42
[Continuing the above comment..] the count instead it throws the same exception "Cannot read from a closed TextReader". — Piush, Jul 24 '14 at 19:43

Servy · Answer 2 · 2014-07-23T19:46:32.563

No. At that point in the program zero lines of the file need to have been read from disk and stored in memory. It's not until you ask for the first line (you have yet to ask for a single line in that snippet) that it needs to fetch the first line. It's not until you ask for the line after that that it needs to fetch the second line, and so on.
That program will require the entirety of the file to be read into memory, all at once, in order to fetch the last line. If you have a 3 GB file, you need 3 GB of memory.
Yes, the first snippet will read through the entire file twice, without ever needing to store more than one line in memory at any point in time. The memory footprint of that program will be O(1), rather than being dependent on the size of the program. It does require having to read through the whole program start to finish twice, so it may take longer to execute, but it'll consume vastly less memory than the snippet you showed just before it. Of course, there are ways of using ReadLines to both count the lines and fetch the last line without iterating through the sequence twice, which is what you should really do so that you can get the best of both worlds.

"Need 3 GB of continuous memory." That's not entirely true. He'll need contiguous memory to hold `lineCount` string references, but the individual lines are allocated separately. — Jim Mischel, Jul 23 '14 at 19:04
@Servy Could you please help me with one of the ways of using ReadLines to both count the lines and fetch the last line without iterating through the sequence twice. Thanks! — Piush, Jul 24 '14 at 19:55

score 1 · Answer 3 · answered Jul 23 '14 at 17:14

The ReadLines() method uses an enumerator to read each line only as needed, so code like this can work because the method is getting each line as needed:

foreach (string line in File.ReadLines("c:\\file.txt"))
{
    Console.WriteLine("-- {0}", line);
}

If the file is large, the ReadLines() method is useful because it will not need to keep all the data in memory at once. Also, if your program exits the loop early, the ReadLines() is better because no further I/O will be needed.

The ReadAllLines() method reads the entire file into memory and then returns an array of those lines.

score 0 · Answer 4 · answered Jul 23 '14 at 17:14

Yes, that method is done execution. No, execution is not over. The enumerable that was returned has all the necessary data and behavior to read from the file and hand you lines.
When File.ReadAllLines is done the entire file has been read. A string[] cannot lazily return results. So just from the type you see that File.ReadAllLines eagerly performs all work.
Yes, you're reading the file twice. That does not have to be so. Run over the lines returned with a loop, maintain a counter and the last line seen. That allows you to compute the two values in one pass over the file.

score 0 · Answer 5 · answered Jul 23 '14 at 17:17

You can use ReadLines like so:

foreach (string line in File.ReadLines(fileWithPath))
{
    if (line.Contains("bla bla") & line.Contains("do do"))
    {

    }
    totalLines += 1;
}

You are not waiting for the whole array of strings be returned before you accessing the array. Unlike this where you are loading the entire array before continuing:

string[] readText = File.ReadAllLines(path);
foreach (string s in readText)
{
    Console.WriteLine(s);
}

Need help in understanding the explanation by Microsoft for File.ReadLines and File.ReadAllLines

5 Answers5