How to achieve 100% CPU usage in multithreaded application?

Question

I have ~100 text files 200MB each and I need to parse them. The program below loads files and processes them in parallel. It can create a Thread per file or a Process per file.

The problem: If I use threads it never uses 100% CPU and takes longer to complete.

THREAD PER FILE
total time: 430 sec
CPU usage 15-20%
CPU frequency 1.2 GHz

PROCESS PER FILE
total time 100 sec
CPU usage 100%
CPU frequency 3.75 GHz

I'm using E5-1650 v3 Hexa-Core with HT, therefore I process 12 files at a time.

How can I achive 100% CPU utilisation by threads?

Code below does not use result of processing since it doen not affect the problem.

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Threading;

namespace libsvm2tsv
{
    class Program
    {
        static void Main(string[] args)
        {
            var sw = Stopwatch.StartNew();

            switch (args[0])
            {
                case "-t": LoadAll(args[1], LoadFile); break;
                case "-p": LoadAll(args[1], RunChild); break;
                case "-f": LoadFile(args[1]); return;
            }

            Console.WriteLine("ELAPSED: {0} sec.", sw.ElapsedMilliseconds / 1000);
            Console.ReadLine();
        }

        static void LoadAll(string folder, Action<string> algorithm)
        {
            var sem = new SemaphoreSlim(12);
            Directory.EnumerateFiles(folder).ToList().ForEach(f=> {
                sem.Wait();
                new Thread(() => { try { algorithm(f); } finally { sem.Release(); } }).Start();
            });
        }

        static void RunChild(string file)
        {
            Process.Start(new ProcessStartInfo
            {
                FileName = Assembly.GetEntryAssembly().Location,
                Arguments = "-f \"" + file + "\"",
                UseShellExecute = false,
                CreateNoWindow = true
            })
            .WaitForExit();
        }

        static void LoadFile(string inFile)
        {
            using (var ins = File.OpenText(inFile))
                while (ins.Peek() >= 0)
                    ParseLine(ins.ReadLine());
        }

        static long[] ParseLine(string line)
        {
            return line
                .Split()
                .Skip(1)
                .Select(r => (long)(double.Parse(r.Split(':')[1]) * 1000))
                .Select(r => r < 0 ? -1 : r)
                .ToArray();
        }
    }
}

Simplify your code, explain the actual problem and *don't* try to handle threads by yourself. TPL was built to make such things easier. Don't split strings either, a Regex is orders of magnitude faster *and* doesn't generate temporary strings. It's so much faster that you may not need multithreaded processing. — Panagiotis Kanavos, Jun 02 '17 at 08:31
The throughput is probably limited by your disk rather than by your CPU. Therefore, the CPUs are waiting and are not fully loaded. — Axel Kemper, Jun 02 '17 at 08:32
Finally, your code is IO bound, not CPU bound. Use asynchronous methods to avoid blocking threads while waiting for IO to complete — Panagiotis Kanavos, Jun 02 '17 at 08:32
It's optimistic to assume you can do disk IO fast enough to run whatever you're doing at full speed. Sometimes you can, if you do enough work. But there is absolutely no guarantee that it can be done in general. — harold, Jun 02 '17 at 08:33
@harold both .NET and Windows cache files. Processing two files in parallel will run faster in a multicore CPU — Panagiotis Kanavos, Jun 02 '17 at 08:35
@AntontBurtsev please explain the *actual* problem. What do you want to parse? — Panagiotis Kanavos, Jun 02 '17 at 08:36
@PanagiotisKanavos faster is not the question. OP wants to reach 100% CPU utilization. — harold, Jun 02 '17 at 08:39
@harold that is the same question. What's the point of reaching 100% CPU utilization with spinlocks? Or performing garbage collection? Caching means that multiple cores *will* have data to process in the general case. Maybe not all of them, maybe half will load data and the other half will process it — Panagiotis Kanavos, Jun 02 '17 at 08:41
First thoughts were the same as @AxelKemper . I/O performance is often the bottleneck for such operations so ensure that your disk(s) are fast enough, best is SSDs. And don't limit the number of threads to the number of cores (*2 for hyperthreading CPUs). As while processing there will be times the CPU will be waiting for internal transfer of data/instructions. So more threads means more CPU utilisation. Oh, and of course normally 100% utilisation is not the most efficient! Some people consider > 80% unacceptable, as there are clashes for CPU resources = longer processing time — jason.kaisersmith, Jun 02 '17 at 09:06
@Panagiotis Kanavos The problem is time it takes to parse files. But using multiple processes is not the solution since it is harder to maintain. — Anton Burtsev, Jun 02 '17 at 09:22
@Axel Kemper. If I add "return null" in ParseLine method both aproaches work 35 sec. Also, copying files in explorer also takes ~35 sec. So disks are fast enough. — Anton Burtsev, Jun 02 '17 at 09:24
You are doing a lot of IO work so you need more than 12 (number of core) threads to fully utilize your CPU (assuming your hard drive is fast enough). Use `Parallel.ForEach` with `MaxDegreeOfParallelism` set to something like 24 and see how it goes (or use async\await with SemaphoreSlim, but that is a bit harder). — Evk, Jun 02 '17 at 09:26
@Evk Parallel did not change anything. http://olap.izobility.net/data.7z - this is real data http://olap.izobility.net/data.7z may be someone could try and share experience. — Anton Burtsev, Jun 02 '17 at 09:33
If you followed example from the answer - it does not set MaxDegreeOfParallelism, so makes processing with `number of cores` threads, but you need more. — Evk, Jun 02 '17 at 09:34
I used Parallel.ForEach( Directory.EnumerateFiles(folder), new ParallelOptions { MaxDegreeOfParallelism = 12 }, f => algorithm(f)); — Anton Burtsev, Jun 02 '17 at 09:35
Have you tried Log Parser for file parsing (https://technet.microsoft.com/en-us/scriptcenter/dd919274.aspx)? — Nisus, Jun 02 '17 at 09:38
Well I suggested to try 24, not 12. I don't have fast enough disk to ever reach 100% cpu with this, however many threads I will try. Try this: `ThreadPool.SetMinThreads(64, 64); Parallel.ForEach(Directory.EnumerateFiles(folder), new ParallelOptions { MaxDegreeOfParallelism = 24 }, LoadFile);` and if not enough - increase 24 even more. — Evk, Jun 02 '17 at 09:45
@PanagiotisKanavos no it's not same question, and why would you even assume I meant it that way? Increasing the performance is one thing, reaching 100% CPU utilization is quite an other - it can only be done if disk IO isn't the bottleneck. — harold, Jun 02 '17 at 11:02

score 2 · Accepted Answer · answered Jun 02 '17 at 13:16

Finally, I've found the bottleneck. I'm using string.Split to parse numbers from every line of data, so I get billions short strings. These strings are put in heap. Since all threads share single heap memory allocation is synchronized. Since processes have separate heaps - no syncronization occures and things work fast. That's the root of issue. So, I rewrote parsing using IndexOf rather than Split and threads started to perform even better than separate processes. Just as I expected it to be.

Since .NET has no default tool to parse real numbers out of the certain position inside string I used this one: https://codereview.stackexchange.com/questions/75791/optimize-custom-double-parse with small modification.

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Threading;
using System.Threading.Tasks;

namespace libsvm2tsv
{
    class Program
    {

        static void Main(string[] args)
        {
            var sw = Stopwatch.StartNew();

            switch (args[0])
            {
                case "-t": LoadAll(args[1], LoadFile); break;
                case "-p": LoadAll(args[1], RunChild); break;
                case "-f": LoadFile(args[1]); return;
            }

            Console.WriteLine("ELAPSED: {0} sec.", sw.ElapsedMilliseconds / 1000);
            Console.ReadLine();
        }

        static void LoadAll(string folder, Action<string> algorithm)
        {
            Parallel.ForEach(
                Directory.EnumerateFiles(folder),
                new ParallelOptions { MaxDegreeOfParallelism = 12 },
                f => algorithm(f));
        }

        static void RunChild(string file)
        {
            Process.Start(new ProcessStartInfo
            {
                FileName = Assembly.GetEntryAssembly().Location,
                Arguments = "-f \"" + file + "\"",
                UseShellExecute = false,
                CreateNoWindow = true
            })
            .WaitForExit();
        }

        static void LoadFile(string inFile)
        {
            using (var ins = File.OpenText(inFile))
                while (ins.Peek() >= 0)
                    ParseLine(ins.ReadLine());
        }

        static long[] ParseLine(string line)
        {
            // first, count number of items
            var items = 1;
            for (var i = 0; i < line.Length; i++)
                if (line[i] == ' ') items++;

            //allocate memory and parse items
            var all = new long[items];
            var n = 0;
            var index = 0;
            while (index < line.Length)
            {
                var next = line.IndexOf(' ', index);
                if (next < 0) next = line.Length;
                if (next > index)
                {
                    var v = (long)(parseDouble(line, line.IndexOf(':', index) + 1, next - 1) * 1000);
                    if (v < 0) v = -1;
                    all[n++] = v;

                }
                index = next + 1;
            }

            return all;
        }

        private readonly static double[] pow10Cache;
        static Program()
        {
            pow10Cache = new double[309];

            double p = 1.0;
            for (int i = 0; i < 309; i++)
            {
                pow10Cache[i] = p;
                p /= 10;
            }
        }

        static double parseDouble(string input, int from, int to)
        {
            long inputLength = to - from + 1;
            long digitValue = long.MaxValue;
            long output1 = 0;
            long output2 = 0;
            long sign = 1;
            double multiBy = 0.0;
            int k;

            //integer part
            for (k = 0; k < inputLength; ++k)
            {
                digitValue = input[k + from] - 48; // '0'

                if (digitValue >= 0 && digitValue <= 9)
                {
                    output1 = digitValue + (output1 * 10);
                }
                else if (k == 0 && digitValue == -3 /* '-' */)
                {
                    sign = -1;
                }
                else if (digitValue == -2 /* '.' */ || digitValue == -4 /* ',' */)
                {
                    break;
                }
                else
                {
                    return double.NaN;
                }
            }

            //decimal part
            if (digitValue == -2 /* '.' */ || digitValue == -4 /* ',' */)
            {
                multiBy = pow10Cache[inputLength - (++k)];

                for (; k < inputLength; ++k)
                {
                    digitValue = input[k + from] - 48; // '0'

                    if (digitValue >= 0 && digitValue <= 9)
                    {
                        output2 = digitValue + (output2 * 10);
                    }
                    else
                    {
                        return Double.NaN;
                    }
                }

                multiBy *= output2;
            }

            return sign * (output1 + multiBy);
        }
    }
}

Another fix is possible. Just add to app.config. It uses a number of memory blocks to avoid locks (somethidng like ThreadStatic). However, memory allocation overhead is about 50-60% compared to solution without string.Split() — Anton Burtsev, Jun 21 '17 at 12:06

Andrew Henle · Answer 2 · 2017-06-02T10:37:07.173

I have ~100 text files 200MB each and I need to parse them.

The fastest way to read or write data from/to a spinning disk is sequentially in order to minimize the time the disk heads need to seek to find data or write it to the specified location. So doing parallel IO to a single disk is going to slow IO rates down - and depending on the actual IO pattern it can slow rates down dramatically. A disk that can handle 100 MB/sec sequentially might only be able to move 20 or 30 kilobytes per second doing parallel reads/writes of small blocks of data.

Were I optimizing such a process, I wouldn't worry about CPU utilization first, I'd optimize IO throughput first. You are IO bound unless you're doing some really CPU-intensive parsing. Once your IO throughput is optimized, if you're getting 100% CPU utilization then you're CPU bound. If your design scales nicely, then you can add CPUs and probably run faster.

To speed up your IO, you first need to minimize disk seeks, especially if you're using consumer-grade, cheap SATA drives. There are multiple ways to do this.

First, the easiest - eliminate the disk heads. Put your data on SSDs. Problem solved without having to write complex, bug-prone optimized code. How much time will it take for you to make this run faster using software? You have to design something, test it, tune it, debug it, and importantly, keep it running and running well. None of that is free. One important cost is the opportunity cost of spending time making things go faster - when you're doing that, you're not solving any other problems. Faster hardware has none of those costs. In this case, buy the SSDs, plug them in, and you're faster.

But if you really want to spend several weeks or longer optimizing your processing software, here's how I'd go about it:

Spread the data over multiple disks. You can't do parallel IO to physical disks quickly as the disk head seeks will kill performance. So do as much of the reading and writing to different disks as possible.
For each disk, have a single reader or writer thread or process that feeds data to a worker pool or writes data provided by that worker pool.
A tunable number of worker threads/processes to do the actual parsing.

That way, you can read the files and write output data all sequentally and without contention on each disk from other IO processes.

Yeah. It may make sense to make a multi step process. FIrst step reads in the file - single threaded - then hands the content off to the multi threaded parsing. — TomTom, Jun 02 '17 at 10:43

mjwills · Answer 3 · 2017-06-02T09:01:41.170

0

I would consider replacing ForEach with Parallel.ForEach and remove your explicit use of Threads. Use https://stackoverflow.com/a/5512363/34092 to set the number of threads to limit it to.

static void LoadAll(string folder, Action<string> algorithm)
{
    Parallel.ForEach(Directory.EnumerateFiles(folder), algorithm);
}

edited Jun 02 '17 at 09:01

answered Jun 02 '17 at 08:34

mjwills

23,389
6
40
63

This won't help with all that string spliting. No reason to set a limit either, Parallel.ForEach does that – Panagiotis Kanavos Jun 02 '17 at 08:35
I tested the code locally and hit 99% CPU usage (up from 30-40% with the existing code). Admittedly, I did it with some test files, so perhaps that won't be exactly what Anton will see (and more importantly, whether it is actually faster than the process based approach). – mjwills Jun 02 '17 at 09:00
rewrote with Parallel - same problem – Anton Burtsev Jun 02 '17 at 09:19
if you could try with my real files - woud be great to know your results – Anton Burtsev Jun 02 '17 at 09:30
The process based approach is always bad. Setting up processes - particularly .NET - is not exactly super fast. It must be worth it to be efficient. – TomTom Jun 02 '17 at 10:44
@AntonBurtsev Once the files download I'll have a go. I take it you are running on SSDs? – mjwills Jun 02 '17 at 10:48

score 0 · Answer 4 · answered Jun 02 '17 at 12:50

As others have stated IO will probably be a bottleneck in the end and getting 100% CPU usage is really irrelevant. I feel they are missing something, though: you do get higher throughput with processes than with threads and that means IO is not the only bottleneck. The reason is that the CPU runs with a higher frequency with processes and you want it to run at peak spead when it is not waiting for IO! So, how can you do that?

The easiest way is to set the power profile from the power options manually. Edit power options and set both minimum and maximum processor state to 100%. That should do the job.

If you want to do it from your program, have a look at How to Disable Dynamic Frequency Scaling?. There is probably a similar API for .NET without using native code, but I couldn't find it now.

How to achieve 100% CPU usage in multithreaded application?

4 Answers4