4
SemaphoreSlim sm = new SemaphoreSlim(10);

using (FileStream fileStream = File.OpenRead("..."))
using (StreamReader streamReader = new StreamReader(fileStream, Encoding.UTF8, true, 4096))
{
    String line;
    while ((line = streamReader.ReadLine()) != null)
    {
        sm.Wait();
        new Thread(() =>
        {
            doSomething(line);
            sm.Release();
        }).Start();
    }
}
MessageBox.Show("This should only show once doSomething() has done its LAST line.");

So, I have an extremely large file that I want to execute code on every single line.

I want to do it in Parallel but at a maximum of 10 at a time.

My solution for that was to use SemaphoreSlim to wait and release when the thread is finished. (Since the function is synchronous the placement of .Release() works).

The issue is the code takes a LOT of CPU. Memory is going just as expected and instead of loading in over 400mb, it just goes up and down a few mbs every few seconds.

But CPU goes crazy, its most of the time locked at 100% for a good 30 seconds and dips down slightly and goes back.

Since I don't want to load every line into memory, and want to run code as it goes, whats the best solution here?

500 Lines In on a 9,700 line file.

enter image description here

600 Lines In on a 2.7 million line file.

enter image description here

EDIT

I changed from new Thread(()=>{}).Start(); to Task.Factory.StartNew(()=>{}); as per mentioned in comments, it seems that the Thread Creation and Destroying is causing the performance drop. And it seems to be right. After I moved to Task.Factory.StartNew it runs same speed as per mentioned by the Semaphore, and its CPU is exactly like my Parallel.ForEach code version.

1 Answers1

10

Your code creates a huge number of threads, which is inefficient. C# has easier ways of handling with your scenario. One approach is:

File.ReadLines(path, Encoding.UTF8)
    .AsParallel().WithDegreeOfParallelism(10)
    .ForAll(doSomething);
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • Doesn't that load every line in at once? The reason im doing it line-by-line is my files are way too big to store them in memory. –  Mar 11 '18 at 06:50
  • @user8549339 - I've added descriptions and documentations. No, `File.ReadLines` is designed for your use case and reads lines lazily as needed. `ReadAllLines` would read all lines eagerly. – Kobi Mar 11 '18 at 06:54
  • `s/while/whole` – Patrick Roberts Mar 11 '18 at 06:54
  • Wow, Perfect, It's working exactly as needed. I did find a solution to the Threads issue and I added it in an Edit, but this will be marked as answer as its a more convenient way of what I was doing (aslong as you dont mind Linq). –  Mar 11 '18 at 06:56
  • One question, Is there a way to get the line count without storing every line in memory? –  Mar 11 '18 at 06:56
  • @user8549339 - If you mean getting the line number to `doSomething`, you can use `.ReadLines(path, Encoding.UTF8).Select((line, lineNumber)=>new {LineContent=line, LineNumber=lineNumber})`, and then `.ForAll(l => doSomething(l.LineContent, l.LineNumber))`. – Kobi Mar 11 '18 at 07:18
  • @user8549339 - If you just what the line count, `File.ReadLines(path, Encoding.UTF8).Count()` will read each line, but not all lines at the same time, so it will still work for a very big file, but will create a lot of work for the garbage collector. – Kobi Mar 11 '18 at 07:22
  • @Kobi Thanks! I went a slightly different route though. But all good! –  Mar 11 '18 at 08:09
  • New issue, Days after this was answered. Basically, in the ForAll() im doing a conditional check on the lines before doing the actual code. For ex: `if(line.Contains("!!")){ doSomething(); }` theres no else or anything. The issue here, is its still going to take up one of those Parallelism threads. Causing a slowdown if a lot of the lines dont contain !!. Is there any kind of workaround here or should I use the max threads on WithDegreeOfParallelism, and then use a SemaphoreSlim.Wait()/Release() inside the if? –  Mar 13 '18 at 11:50
  • @user8549339 - You refactor the method and separate the code that checks the condition from the code that executes the command. In the example, you could do `.ReadLines().Where(line => !line.Contains("!!")).AsParallel()`, so you only get a task for each line that matches your condition. It's a little difficult to understand your exact scenario, so it would be better if you posted a new question with an updated example - possible also with the file content. – Kobi Mar 13 '18 at 12:01
  • 1
    Perfect! Exactly what was needed. I would have thought this would cause it to actually load every line in memory, guess not. Thanks! –  Mar 14 '18 at 13:24