Read and process files in parallel C#

Question

I have very big files that I have to read and process. Can this be done in parallel using Threading?

Here is a bit of code that I've done. But it doesen't seem to get a shorter execution time the reading and processing the files one after the other.

String[] files = openFileDialog1.FileNames;

Parallel.ForEach(files, f =>
{
    readTraceFile(f);
});        

private void readTraceFile(String file)
{
    StreamReader reader = new StreamReader(file);
    String line;

    while ((line = reader.ReadLine()) != null)
    {
        String pattern = "\\s{4,}";

        foreach (String trace in Regex.Split(line, pattern))
        {
            if (trace != String.Empty)
            {
                String[] details = Regex.Split(trace, "\\s+");

                Instruction instruction = new Instruction(details[0],
                    int.Parse(details[1]),
                    int.Parse(details[2]));
                Console.WriteLine("computing...");
                instructions.Add(instruction);
            }
        }
    }
}

IO systems are not fast as your CPU, So it is not suprised not to get any benefit in using multiple threads when IO is involved. — L.B, Jan 05 '14 at 00:55
How can i make it threadsafe? I am new to multithreading, and I have to do this application for a school project — luca.p.alexandru, Jan 05 '14 at 00:56
I know they are not as fast as the CPU, so aren't there any methods with wich I could process files faster? — luca.p.alexandru, Jan 05 '14 at 00:57
@user2936347 It can.. but only if you're performing computations on it whilst its in memory (CPU-bound). If you are waiting for a large file to be loaded into memory (I/O-Bound).. then no. — Simon Whitehead, Jan 05 '14 at 00:59
@user2936347: It can, but you need to make sure it's properly thread-safe (which is not easy). In particular, you cannot update shared mutable state. — SLaks, Jan 05 '14 at 00:59
@user2936347 `So parallel file processing can't be done in parallel?` Suppose you have N Mb internet access. It is not important how much thread you use. you can not exceed that speed. So you can use parallelism but that doesn't mean it will be faster. — L.B, Jan 05 '14 at 01:03
Would it make any difference if I read a file after the other, split the files's content and process the splitted content on different threads? — luca.p.alexandru, Jan 05 '14 at 01:27
@user2936347, that's the way to go. You have a dedicated thread (or an async `Task` if you want) for IO, and another thread or `Task` processing the content as it becomes available (possibly in parallel if your processing becomes a bottleneck). Classic producer-consumer. See Patterns of Parallel Programming (http://www.microsoft.com/en-au/download/details.aspx?id=19222). Page 55 is almost *exactly* your scenario. — Kirill Shlenskiy, Jan 05 '14 at 01:50

Kirill Shlenskiy · Accepted Answer · 2014-01-05T02:50:15.520

It looks like your application's performance is mostly limited by IO. However, you still have a bit of CPU-bound work in your code. These two bits of work are interdependent: your CPU-bound work cannot start until the IO has done its job, and the IO does not move on to the next work item until your CPU has finished with the previous one. They're both holding each other up. Therefore, it is possible (explained at the very bottom) that you will see an improvement in throughput if you perform your IO- and CPU-bound work in parallel, like so:

void ReadAndProcessFiles(string[] filePaths)
{
    // Our thread-safe collection used for the handover.
    var lines = new BlockingCollection<string>();

    // Build the pipeline.
    var stage1 = Task.Run(() =>
    {
        try
        {
            foreach (var filePath in filePaths)
            {
                using (var reader = new StreamReader(filePath))
                {
                    string line;

                    while ((line = reader.ReadLine()) != null)
                    {
                        // Hand over to stage 2 and continue reading.
                        lines.Add(line);
                    }
                }
            }
        }
        finally
        {
            lines.CompleteAdding();
        }
    });

    var stage2 = Task.Run(() =>
    {
        // Process lines on a ThreadPool thread
        // as soon as they become available.
        foreach (var line in lines.GetConsumingEnumerable())
        {
            String pattern = "\\s{4,}";

            foreach (String trace in Regex.Split(line, pattern))
            {
                if (trace != String.Empty)
                {
                    String[] details = Regex.Split(trace, "\\s+");

                    Instruction instruction = new Instruction(details[0],
                        int.Parse(details[1]),
                        int.Parse(details[2]));
                    Console.WriteLine("computing...");
                    instructions.Add(instruction);
                }
            }
        }
    });

    // Block until both tasks have completed.
    // This makes this method prone to deadlocking.
    // Consider using 'await Task.WhenAll' instead.
    Task.WaitAll(stage1, stage2);
}

I highly doubt that it's your CPU work holding things up, but if it happens to be the case, you can also parallelise stage 2 like so:

    var stage2 = Task.Run(() =>
    {
        var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };

        Parallel.ForEach(lines.GetConsumingEnumerable(), parallelOptions, line =>
        {
            String pattern = "\\s{4,}";

            foreach (String trace in Regex.Split(line, pattern))
            {
                if (trace != String.Empty)
                {
                    String[] details = Regex.Split(trace, "\\s+");

                    Instruction instruction = new Instruction(details[0],
                        int.Parse(details[1]),
                        int.Parse(details[2]));
                    Console.WriteLine("computing...");
                    instructions.Add(instruction);
                }
            }
        });
    });

Mind you, if your CPU work component is negligible in comparison to the IO component, you won't see much speed-up. The more even the workload is, the better the pipeline is going to perform in comparison with sequential processing.

Since we're talking about performance note that I am not particularly thrilled about the number of blocking calls in the above code. If I were doing this in my own project, I would have gone the async/await route. I chose not to do so in this case because I wanted to keep things easy to understand and easy to integrate.

Good example code. About the "number of blocking calls". Isn't it enough changing the last row to await Task.WhenAll? And of course change the method to async Task. — Jonas, Oct 18 '18 at 11:22

score 7 · Answer 2 · answered Jan 05 '14 at 01:03

7

From the look of what you are trying to do, you are almost certainly I/O bound. Attempting parallel processing in the case will not help and may in fact slow down processing due to addition seek operations on the disk drives (unless you can have the data split over multiple spindles).

answered Jan 05 '14 at 01:03

Gary Walker

8,831
3
19
41

And if I am I/O bound, can there anything be done in order to increase performance? – luca.p.alexandru Jan 05 '14 at 01:07
@user2936347 usually doing many asynchronous calls is better for I/O. take a look at the new `async-await` pattern – i3arnon Jan 05 '14 at 01:45
@user2936347: There are a few strategies to help with I/O issues. However most require an investment in hardware. Whether that means a single faster drive (like SSD), RAID 0 or 1, or even just splitting the files across multiple drives each with their own independent controllers or some combination thereof. – NotMe Jan 05 '14 at 01:55

score 0 · Answer 3 · answered May 03 '16 at 15:37

Try processing the lines in parallel instead. For example:

var q = from file in files
        from line in File.ReadLines(file).AsParallel()    // for smaller files File.ReadAllLines(file).AsParallel() might be faster
        from trace in line.Split(new [] {"    "}, StringSplitOptions.RemoveEmptyEntries)  // split by 4 spaces and no need for trace != "" check
        let details = trace.Split(null as char[], StringSplitOptions.RemoveEmptyEntries)  // like Regex.Split(trace, "\\s+") but removes empty strings too
        select new Instruction(details[0], int.Parse(details[1]), int.Parse(details[2]));

List<Instruction> instructions = q.ToList();  // all of the file reads and work is done here with .ToList

Random access to a non-SSD hard drive (when you try to read/write different files at the same time or a fragmented file) is usually much slower than sequential access (for example reading single defragmented file), so I expect processing single file in parallel to be faster with defragmented files.

Also, sharing resources across the threads (for example Console.Write or adding to a thread safe blocking collection) can slow down or block/deadlock the execution, because some of the threads will have to wait for the other threads to finish accessing that resource.

Thanks, but this is a two year old topic which I needed on a task for school :) — luca.p.alexandru, May 04 '16 at 09:17

score -1 · Answer 4 · answered Nov 03 '22 at 12:54

           var entries = new ConcurrentBag<object>();    
            var files = Directory.GetFiles(path, "*.txt", SearchOption.AllDirectories);    
            int fileCounter = 0;

            Parallel.ForEach(files.ToList(), file =>
            {
                var lines = File.ReadAllLines(file, Encoding.Default);    
                entries.Add(new { lineCount = lines.Length });    
                Interlocked.Increment(ref fileCounter);
            });

Read and process files in parallel C#

4 Answers4

Linked