58

I am trying to read some text files, where each line needs to be processed. At the moment I am just using a StreamReader, and then reading each line individually.

I am wondering whether there is a more efficient way (in terms of LoC and readability) to do this using LINQ without compromising operational efficiency. The examples I have seen involve loading the whole file into memory, and then processing it. In this case however I don't believe that would be very efficient. In the first example the files can get up to about 50k, and in the second example, not all lines of the file need to be read (sizes are typically < 10k).

You could argue that nowadays it doesn't really matter for these small files, however I believe that sort of the approach leads to inefficient code.

First example:

// Open file
using(var file = System.IO.File.OpenText(_LstFilename))
{
    // Read file
    while (!file.EndOfStream)
    {
        String line = file.ReadLine();

        // Ignore empty lines
        if (line.Length > 0)
        {
            // Create addon
            T addon = new T();
            addon.Load(line, _BaseDir);

            // Add to collection
            collection.Add(addon);
        }
    }
}

Second example:

// Open file
using (var file = System.IO.File.OpenText(datFile))
{
    // Compile regexs
    Regex nameRegex = new Regex("IDENTIFY (.*)");

    while (!file.EndOfStream)
    {
        String line = file.ReadLine();

        // Check name
        Match m = nameRegex.Match(line);
        if (m.Success)
        {
            _Name = m.Groups[1].Value;

            // Remove me when other values are read
            break;
        }
    }
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Luca Spiller
  • 2,248
  • 4
  • 24
  • 28
  • 2
    50K isn't even large enough to make it into the large object heap. Streaming makes sense when your files are in the megabyte (or bigger) range, not kilobytes. – Joe Chung Aug 13 '09 at 12:38

5 Answers5

96

You can write a LINQ-based line reader pretty easily using an iterator block:

static IEnumerable<SomeType> ReadFrom(string file) {
    string line;
    using(var reader = File.OpenText(file)) {
        while((line = reader.ReadLine()) != null) {
            SomeType newRecord = /* parse line */
            yield return newRecord;
        }
    }
}

or to make Jon happy:

static IEnumerable<string> ReadFrom(string file) {
    string line;
    using(var reader = File.OpenText(file)) {
        while((line = reader.ReadLine()) != null) {
            yield return line;
        }
    }
}
...
var typedSequence = from line in ReadFrom(path)
                    let record = ParseLine(line)
                    where record.Active // for example
                    select record.Key;

then you have ReadFrom(...) as a lazily evaluated sequence without buffering, perfect for Where etc.

Note that if you use OrderBy or the standard GroupBy, it will have to buffer the data in memory; ifyou need grouping and aggregation, "PushLINQ" has some fancy code to allow you to perform aggregations on the data but discard it (no buffering). Jon's explanation is here.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 23
    Bah, separation of concerns - separate out the line reading into a separate iterator, and use normal projection :) – Jon Skeet Aug 13 '09 at 10:48
  • Much nicer... though still file-specific ;) – Jon Skeet Aug 13 '09 at 10:54
  • I don't think that your examples will compile. "file" is already defined as a string param, so you can't make that declaration in the using block. – Justin R. Aug 07 '10 at 00:28
  • Great technique, thanks Marc! If it helps anyone I've put together a blog post regarding using this to aid reading csv in linqpad: http://www.developertipoftheday.com/2012/10/read-csv-in-linqpad.html – Alex KeySmith Oct 29 '12 at 21:43
  • @JonSkeet the link provided above by Marc is no more valid, can you please provide the new link – Mrinal Kamboj Jun 05 '15 at 08:58
  • I neither understood nor ever used "yield" until today - despite 20 years of programming C#. This is such a good and easy example! Thanks also for that ;-) – Tillito Feb 10 '17 at 15:41
  • Framework now includes [File.ReadLines()](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readlines) which does the same as above code `static IEnumerable ReadFrom(string file)`, right? – Stéphane Gourichon Mar 18 '19 at 08:47
24

It's simpler to read a line and check whether or not it's null than to check for EndOfStream all the time.

However, I also have a LineReader class in MiscUtil which makes all of this a lot simpler - basically it exposes a file (or a Func<TextReader> as an IEnumerable<string> which lets you do LINQ stuff over it. So you can do things like:

var query = from file in Directory.GetFiles("*.log")
            from line in new LineReader(file)
            where line.Length > 0
            select new AddOn(line); // or whatever

The heart of LineReader is this implementation of IEnumerable<string>.GetEnumerator:

public IEnumerator<string> GetEnumerator()
{
    using (TextReader reader = dataSource())
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

Almost all the rest of the source is just giving flexible ways of setting up dataSource (which is a Func<TextReader>).

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • How to close the file? And release the resource? – ca9163d9 Aug 28 '14 at 16:56
  • @dc7a9163d9: The `using` statement does that already - the `dataSource()` call will open the file, and so it will be disposed at the end of the `using` statement. – Jon Skeet Aug 28 '14 at 16:57
2

Since .NET 4.0, the File.ReadLines() method is available.

int count = File.ReadLines(filepath).Count(line => line.StartsWith(">"));
user7610
  • 25,267
  • 15
  • 124
  • 150
1

NOTE: You need to watch out for the IEnumerable<T> solution, as it will result in the file being open for the duration of processing.

For example, with Marc Gravell's response:

foreach(var record in ReadFrom("myfile.csv")) {
    DoLongProcessOn(record);
}

the file will remain open for the whole of the processing.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
kͩeͣmͮpͥ ͩ
  • 7,783
  • 26
  • 40
  • 2
    True, but "file open for a long time, but no buffering" is often better than "lots of memory hogged for a long time" – Marc Gravell Aug 13 '09 at 10:53
  • 6
    That's true - but basically you've got three choices: load the lot in one go (fails for big files); hold the file open (as you mention); reopen the file regularly (has a number of issues). In many, many cases I believe that streaming and holding the file open is the best solution. – Jon Skeet Aug 13 '09 at 10:55
  • Yes, it's probably a better solution to keep the file open, but you just need to be away of the implication – kͩeͣmͮpͥ ͩ Aug 13 '09 at 11:06
  • Sorry for the name typo there Marc – kͩeͣmͮpͥ ͩ Aug 13 '09 at 11:21
  • It's definitely something to keep in mind as a potentially unexpected side effect, but I also agree with Jon in that it does sound like the best solution. – Mark LeMoine Aug 17 '10 at 17:56
0

Thanks all for your answers! I decided to go with a mixture, mainly focusing on Marc's though as I will only need to read lines from a file. I guess you could argue seperation is needed everywhere, but heh, life is too short!

Regarding the keeping the file open, that isn't going to be an issue in this case, as the code is part of a desktop application.

Lastly I noticed you all used lowercase string. I know in Java there is a difference between capitalised and non capitalised string, but I thought in C# lowercase string was just a reference to capitalised String?

public void Load(AddonCollection<T> collection)
{
    // read from file
    var query =
        from line in LineReader(_LstFilename)
        where line.Length > 0
        select CreateAddon(line);

    // add results to collection
    collection.AddRange(query);
}

protected T CreateAddon(String line)
{
    // create addon
    T addon = new T();
    addon.Load(line, _BaseDir);

    return addon;
}

protected static IEnumerable<String> LineReader(String fileName)
{
    String line;
    using (var file = System.IO.File.OpenText(fileName))
    {
        // read each line, ensuring not null (EOF)
        while ((line = file.ReadLine()) != null)
        {
            // return trimmed line
            yield return line.Trim();
        }
    }
}
Luca Spiller
  • 2,248
  • 4
  • 24
  • 28