0

I want to read and process 10+ lines at a time for GB files, but haven't found a solution to spit out 10 lines until the end.

My last attempt was :

        int n = 10;
        foreach (var line in File.ReadLines("path")
            .AsParallel().WithDegreeOfParallelism(n))
        {
            System.Console.WriteLine(line);
            Thread.Sleep(1000);
        }

I've seen solutions that use buffer sizes but I want to read in the entire row.

Mike
  • 1
  • 1
  • 1
  • you're after the last 10 lines? – BugFinder Jul 13 '16 at 15:13
  • can you not use the `.Take` function to do this perhaps you could look at this and get it to work for you.. also 10 lines at a time would take you forever.. why not set the lines to something like 300 for example check out this link - http://stackoverflow.com/questions/11326564/reading-specific-number-of-lines-from-text-file-in-c-sharp – MethodMan Jul 13 '16 at 15:16
  • Can you please clarify what you expect as result? – Alexei Levenkov Jul 13 '16 at 15:20
  • So you want to read 10 lines, process them, output the result, then read the next 10 lines, etc? And you want to process each group of 10 lines with multiple threads? – Jim Mischel Jul 13 '16 at 16:23
  • Voted to close as "unclear what you're asking," because OP is not responding to queries. – Jim Mischel Jul 13 '16 at 17:39

2 Answers2

1

The Default behavour is to read all the Line in one shot, if you want to read less than that you need to dig a little deeper into how it reads them and get a StreamReader which will then let you control the reading process

        using (StreamReader sr = new StreamReader(path)) 
        {
            while (sr.Peek() >= 0) 
            {
                Console.WriteLine(sr.ReadLine());
            }
        }

it also has a ReadLineAsync method that will return a task

if you contain these tasks in an ConcurrentBag you can very easily keep the processing running on 10 lines at a time.

var bag =new ConCurrentBag<Task>();
using (StreamReader sr = new StreamReader(path))
{
    while(sr.Peek() >=0)
    {
        if(bag.Count < 10)
        {
            Task processing = sr.ReadLineAsync().ContinueWith( (read) => {
                string s = read.Result;//EDIT Removed await to reflect Scots comment
                //process line
            });
            bag.Add(processing);
        }
        else
        {
            Task.WaitAny(bag.ToArray())
            //remove competed tasks from bag
        }
    }
}

note this code is for guidance only not to be used as is;

if all you want is the last ten lines then you can get that with the solution here How to read a text file reversely with iterator in C#

Community
  • 1
  • 1
MikeT
  • 5,398
  • 3
  • 27
  • 43
  • 1
    No need to `await read`, read is gaunteed to be in the completed state (also it would not compile because the anonymous method is not marked async). Just do a `read.Result` – Scott Chamberlain Jul 13 '16 at 16:16
  • Will this spin a tight loop on the `sr.Peek()` when there are 10 tasks processing? How do you remove completed tasks from the bag? – Jim Mischel Jul 13 '16 at 16:18
  • i always err on the side of caution with threading, awaiting something rarely hurt but assuming something has completed when it hasn't can cause nightmares, also note that i mentioned the need to add async in the text. i'm showing them how to do the task not doing it for them – MikeT Jul 13 '16 at 16:19
  • @JimMischel there are several ways you can remove the task, simplest would be a linq with where(t=> t.IsCompleted) then remove the results or you could add a continue with on the process (but not part of it) that removes the task from the bag would work as well, as for the peek i'm not entirely sure i've never actually tried this wouldn't hurt to add null checks to the processing phase and then use that to end the loop, you may waste a few cycles processing nulls – MikeT Jul 13 '16 at 16:23
  • Seems pretty inefficient to start a new `Task` for every line. Guess it depends on how much processing each line takes. – Jim Mischel Jul 13 '16 at 16:27
  • @JimMischel i agree completely but seems to be what the OP whats – MikeT Jul 13 '16 at 16:27
  • @JimMischel re Peek, just to clarify i haven't checked but i see no way ReadLineAsync can read a new line before the last ReadLineAsync is completed so i suspect that i on each loop ReadLineAsync waits for the previous call to complete before firing if so that means each peek will be at most 1 loop out of date – MikeT Jul 13 '16 at 16:41
  • My point is that if `bag.Count == 10`, then the loop tries to clear any completed tasks, and then goes back to `Peek` again. So if it takes any significant time to process a line, the main thread is in a very tight, CPU-chewing, do-nothing loop. – Jim Mischel Jul 13 '16 at 17:38
  • @JimMischel ah now i see what you meant, thats why i had the yield, so that if the queue is full the thread surrendered its processing time to another thread , i swapped it for a WaitAny as no point checking again unless one of the tasks finished – MikeT Jul 14 '16 at 09:10
0

This method would create "pages" of lines from your file.

public static IEnumerable<string[]> ReadFileAsLinesSets(string fileName, int setLen = 10)
{
    using (var reader = new StreamReader(fileName))
        while (!reader.EndOfStream)
        {
            var set = new List<string>();
            for (var i = 0; i < setLen && !reader.EndOfStream; i++)
            {
                set.Add(reader.ReadLine());
            }
            yield return set.ToArray();
        }
}

... More fun version...

class Example
{
    static void Main(string[] args)
    {
        "YourFile.txt".ReadAsLines()
                      .AsPaged(10)
                      .Select(a=>a.ToArray()) //required or else you will get random data since "WrappedEnumerator" is not thread safe
                      .AsParallel()
                      .WithDegreeOfParallelism(10)
                      .ForAll(a =>
        {
            //Do your work here.
            Console.WriteLine(a.Aggregate(new StringBuilder(), 
                                          (sb, v) => sb.AppendFormat("{0:000000} ", v), 
                                          sb => sb.ToString()));
        });
    }
}

public static class ToolsEx
{

    public static IEnumerable<IEnumerable<T>> AsPaged<T>(this IEnumerable<T> items,
                                                              int pageLength = 10)
    {
        using (var enumerator = new WrappedEnumerator<T>(items.GetEnumerator()))
            while (!enumerator.IsDone)
                yield return enumerator.GetNextPage(pageLength);
    }

    public static IEnumerable<T> GetNextPage<T>(this IEnumerator<T> enumerator,
                                                     int pageLength = 10)
    {
        for (var i = 0; i < pageLength && enumerator.MoveNext(); i++)
            yield return enumerator.Current;
    }

    public static IEnumerable<string> ReadAsLines(this string fileName)
    {
        using (var reader = new StreamReader(fileName))
            while (!reader.EndOfStream)
                yield return reader.ReadLine();
    }
}

internal class WrappedEnumerator<T> : IEnumerator<T>
{
    public WrappedEnumerator(IEnumerator<T> enumerator)
    {
        this.InnerEnumerator = enumerator;
        this.IsDone = false;
    }

    public IEnumerator<T> InnerEnumerator { get; private set; }
    public bool IsDone { get; private set; }

    public T Current { get { return this.InnerEnumerator.Current; } }
    object System.Collections.IEnumerator.Current { get { return this.Current; } }

    public void Dispose()
    {
        this.InnerEnumerator.Dispose();
        this.IsDone = true;
    }

    public bool MoveNext()
    {
        var next = this.InnerEnumerator.MoveNext();
        this.IsDone = !next;
        return next;
    }

    public void Reset()
    {
        this.IsDone = false;
        this.InnerEnumerator.Reset();
    }
}
Matthew Whited
  • 22,160
  • 4
  • 52
  • 69
  • Not entirely sure that would work as .Net has a max 2Gb memory page size reading the entire file if it is multi Gb would hit that limit very quickly – MikeT Jul 13 '16 at 17:01
  • It would only read as much into memory as you call. If yo uare using something like `.AsParallel().WithDegreeOfParallelism(n))` it should only have `n` number of pages loaded at any given time. – Matthew Whited Jul 13 '16 at 17:39
  • And yes, It would be posible to make it even more lazy so even the inner set in IEnumerable... but that would be a little more complex than I want to write up for an SO answer.... at least for now. – Matthew Whited Jul 13 '16 at 17:49
  • In first version, it should be 'for (var i = 0; i < setLen && !reader.EndOfStream; i++)' I did test it – Avlin Jul 26 '18 at 13:15