21

I have very big files that I have to read and process. Can this be done in parallel using Threading?

Here is a bit of code that I've done. But it doesen't seem to get a shorter execution time the reading and processing the files one after the other.

String[] files = openFileDialog1.FileNames;

Parallel.ForEach(files, f =>
{
    readTraceFile(f);
});        

private void readTraceFile(String file)
{
    StreamReader reader = new StreamReader(file);
    String line;

    while ((line = reader.ReadLine()) != null)
    {
        String pattern = "\\s{4,}";

        foreach (String trace in Regex.Split(line, pattern))
        {
            if (trace != String.Empty)
            {
                String[] details = Regex.Split(trace, "\\s+");

                Instruction instruction = new Instruction(details[0],
                    int.Parse(details[1]),
                    int.Parse(details[2]));
                Console.WriteLine("computing...");
                instructions.Add(instruction);
            }
        }
    }
}
i3arnon
  • 113,022
  • 33
  • 324
  • 344
luca.p.alexandru
  • 1,660
  • 5
  • 23
  • 42
  • 4
    Are you CPU-bound or IO-bound? – SLaks Jan 05 '14 at 00:55
  • 1
    Is `instructions` thread-safe? (answer: No) – SLaks Jan 05 '14 at 00:55
  • 2
    IO systems are not fast as your CPU, So it is not suprised not to get any benefit in using multiple threads when IO is involved. – L.B Jan 05 '14 at 00:55
  • How can i make it threadsafe? I am new to multithreading, and I have to do this application for a school project – luca.p.alexandru Jan 05 '14 at 00:56
  • I know they are not as fast as the CPU, so aren't there any methods with wich I could process files faster? – luca.p.alexandru Jan 05 '14 at 00:57
  • 1
    @user2936347: Buy a faster hard drive. – SLaks Jan 05 '14 at 00:57
  • 1
    So parallel file processing can't be done in parallel? – luca.p.alexandru Jan 05 '14 at 00:58
  • @spender: Because he's calling it in the parallel code. – SLaks Jan 05 '14 at 00:58
  • 2
    @user2936347 It can.. but only if you're performing computations on it whilst its in memory (CPU-bound). If you are waiting for a large file to be loaded into memory (I/O-Bound).. then no. – Simon Whitehead Jan 05 '14 at 00:59
  • @user2936347: It can, but you need to make sure it's properly thread-safe (which is not easy). In particular, you cannot update shared mutable state. – SLaks Jan 05 '14 at 00:59
  • 1
    @user2936347 `So parallel file processing can't be done in parallel?` Suppose you have N Mb internet access. It is not important how much thread you use. you can not exceed that speed. So you can use parallelism but that doesn't mean it will be faster. – L.B Jan 05 '14 at 01:03
  • Would it make any difference if I read a file after the other, split the files's content and process the splitted content on different threads? – luca.p.alexandru Jan 05 '14 at 01:27
  • 3
    @user2936347, that's the way to go. You have a dedicated thread (or an async `Task` if you want) for IO, and another thread or `Task` processing the content as it becomes available (possibly in parallel if your processing becomes a bottleneck). Classic producer-consumer. See Patterns of Parallel Programming (http://www.microsoft.com/en-au/download/details.aspx?id=19222). Page 55 is almost *exactly* your scenario. – Kirill Shlenskiy Jan 05 '14 at 01:50

4 Answers4

30

It looks like your application's performance is mostly limited by IO. However, you still have a bit of CPU-bound work in your code. These two bits of work are interdependent: your CPU-bound work cannot start until the IO has done its job, and the IO does not move on to the next work item until your CPU has finished with the previous one. They're both holding each other up. Therefore, it is possible (explained at the very bottom) that you will see an improvement in throughput if you perform your IO- and CPU-bound work in parallel, like so:

void ReadAndProcessFiles(string[] filePaths)
{
    // Our thread-safe collection used for the handover.
    var lines = new BlockingCollection<string>();

    // Build the pipeline.
    var stage1 = Task.Run(() =>
    {
        try
        {
            foreach (var filePath in filePaths)
            {
                using (var reader = new StreamReader(filePath))
                {
                    string line;

                    while ((line = reader.ReadLine()) != null)
                    {
                        // Hand over to stage 2 and continue reading.
                        lines.Add(line);
                    }
                }
            }
        }
        finally
        {
            lines.CompleteAdding();
        }
    });

    var stage2 = Task.Run(() =>
    {
        // Process lines on a ThreadPool thread
        // as soon as they become available.
        foreach (var line in lines.GetConsumingEnumerable())
        {
            String pattern = "\\s{4,}";

            foreach (String trace in Regex.Split(line, pattern))
            {
                if (trace != String.Empty)
                {
                    String[] details = Regex.Split(trace, "\\s+");

                    Instruction instruction = new Instruction(details[0],
                        int.Parse(details[1]),
                        int.Parse(details[2]));
                    Console.WriteLine("computing...");
                    instructions.Add(instruction);
                }
            }
        }
    });

    // Block until both tasks have completed.
    // This makes this method prone to deadlocking.
    // Consider using 'await Task.WhenAll' instead.
    Task.WaitAll(stage1, stage2);
}

I highly doubt that it's your CPU work holding things up, but if it happens to be the case, you can also parallelise stage 2 like so:

    var stage2 = Task.Run(() =>
    {
        var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };

        Parallel.ForEach(lines.GetConsumingEnumerable(), parallelOptions, line =>
        {
            String pattern = "\\s{4,}";

            foreach (String trace in Regex.Split(line, pattern))
            {
                if (trace != String.Empty)
                {
                    String[] details = Regex.Split(trace, "\\s+");

                    Instruction instruction = new Instruction(details[0],
                        int.Parse(details[1]),
                        int.Parse(details[2]));
                    Console.WriteLine("computing...");
                    instructions.Add(instruction);
                }
            }
        });
    });

Mind you, if your CPU work component is negligible in comparison to the IO component, you won't see much speed-up. The more even the workload is, the better the pipeline is going to perform in comparison with sequential processing.

Since we're talking about performance note that I am not particularly thrilled about the number of blocking calls in the above code. If I were doing this in my own project, I would have gone the async/await route. I chose not to do so in this case because I wanted to keep things easy to understand and easy to integrate.

Kirill Shlenskiy
  • 9,367
  • 27
  • 39
  • Good example code. About the "number of blocking calls". Isn't it enough changing the last row to await Task.WhenAll? And of course change the method to async Task. – Jonas Oct 18 '18 at 11:22
7

From the look of what you are trying to do, you are almost certainly I/O bound. Attempting parallel processing in the case will not help and may in fact slow down processing due to addition seek operations on the disk drives (unless you can have the data split over multiple spindles).

Gary Walker
  • 8,831
  • 3
  • 19
  • 41
  • And if I am I/O bound, can there anything be done in order to increase performance? – luca.p.alexandru Jan 05 '14 at 01:07
  • @user2936347 usually doing many asynchronous calls is better for I/O. take a look at the new `async-await` pattern – i3arnon Jan 05 '14 at 01:45
  • @user2936347: There are a few strategies to help with I/O issues. However most require an investment in hardware. Whether that means a single faster drive (like SSD), RAID 0 or 1, or even just splitting the files across multiple drives each with their own independent controllers or some combination thereof. – NotMe Jan 05 '14 at 01:55
0

Try processing the lines in parallel instead. For example:

var q = from file in files
        from line in File.ReadLines(file).AsParallel()    // for smaller files File.ReadAllLines(file).AsParallel() might be faster
        from trace in line.Split(new [] {"    "}, StringSplitOptions.RemoveEmptyEntries)  // split by 4 spaces and no need for trace != "" check
        let details = trace.Split(null as char[], StringSplitOptions.RemoveEmptyEntries)  // like Regex.Split(trace, "\\s+") but removes empty strings too
        select new Instruction(details[0], int.Parse(details[1]), int.Parse(details[2]));

List<Instruction> instructions = q.ToList();  // all of the file reads and work is done here with .ToList

Random access to a non-SSD hard drive (when you try to read/write different files at the same time or a fragmented file) is usually much slower than sequential access (for example reading single defragmented file), so I expect processing single file in parallel to be faster with defragmented files.

Also, sharing resources across the threads (for example Console.Write or adding to a thread safe blocking collection) can slow down or block/deadlock the execution, because some of the threads will have to wait for the other threads to finish accessing that resource.

Slai
  • 22,144
  • 5
  • 45
  • 53
-1
           var entries = new ConcurrentBag<object>();    
            var files = Directory.GetFiles(path, "*.txt", SearchOption.AllDirectories);    
            int fileCounter = 0;

            Parallel.ForEach(files.ToList(), file =>
            {
                var lines = File.ReadAllLines(file, Encoding.Default);    
                entries.Add(new { lineCount = lines.Length });    
                Interlocked.Increment(ref fileCounter);
            });
Iman
  • 17,932
  • 6
  • 80
  • 90