17

Background:

In my application written in C++, I have created 3 threads:

  • AnalysisThread (or Producer) : it reads an input file, parses it, and generates patterns, and enqueue them into std::queue1.
  • PatternIdRequestThread (or Consumer) : it deque patterns from the queue, and sends them, one by one, to database through a client (written in C++), which returns pattern uid which is then assigned to the corresponding pattern.
  • ResultPersistenceThread : it does few more things, talks to database, and it works fine as expected, as far as CPU usage is concerned.

First two threads take 60-80% of CPU usage, each takes 35% on average.

Question:

I don't understand why some threads take high CPU usage.

I analyse it as follows : if it is the OS who makes decisions like context-switch, interrupt, and scheduling as to which thread should be given access to system resources, such as CPU time, then how come some threads in a process happen to use more CPU than the others? It looks like some threads forcefully takes CPU from the OS at gunpoint, or the OS has a real soft spot for some threads and so it is biased towards them from the very beginning, giving them all the resources it has. Why can't it be impartial and give them all equally?

I know that it is naive. But I get confused more if I think along this line : the OS gives access to CPU to a thread, based on the amount of work to be done by the thread, but how does the OS compute or predict the amount of work before executing it completely?

I wonder what are the causes for high CPU usage? How can we identify them? Is it possible to identify them just by looking at the code? What are the tools?

I'm using Visual Studio 2010.

1. I've my doubt about std::queue as well. I know that standard containers aren't thread safe. But if exactly one thread enqueue items to queue, then is it safe if exactly one thread deque items from it? I imagine it be like a pipe, on one side you insert data, on the other, you remove data, then why would it be unsafe if its done simultenously? But that is not the real question in this topic, however, you can add a note in your answer, addressing this.

Updates:

After I realized that my consumer-thread was using busy-spin which I've fixed with Sleep for 3 seconds. This fix is temporary, and soon I will use Event instead. But even with Sleep, the CPU usage has dropped down to 30-40%, and occasionally it goes up to 50% which doesn't seem to be desirable from the usability point of view, as the system doesn't respond to other applications which the user is currently working with.

Is there any way that I can still improve on the high CPU usage? As said earlier, the producer thread (which now uses most of the CPU cycles) reads a file, parses packets (of some format) in it, and generates patterns out of them. If I use sleep, then the CPU usage would decrease but would it be a good idea? What are the common ways to solve it?

Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • 2
    Your testcode is most likely IO bound, that's why you won't see higher percentages (you flush the buffer after every write with `std::endl`). – Xeo Feb 14 '12 at 10:29
  • @Xeo: Alright. I just replaced them with only `std::pow`. Now it is still same! – Nawaz Feb 14 '12 at 10:31
  • The pow-loop test takes 25% cpu load in my task manager, just as I would have expected (quad core cpu). – Timbo Feb 14 '12 at 10:32
  • 2
    How are you making the producer/consumer block if the queue is full/empty? Events? Busy spin? – Tudor Feb 14 '12 at 10:33
  • @Xeo: You were right. So I removed that part. – Nawaz Feb 14 '12 at 10:36
  • @Tudor: As of now, I'm using busy spin. But is that causing this much problem? – Nawaz Feb 14 '12 at 10:36
  • 4
    YES. That is most likely the cause. – Tudor Feb 14 '12 at 10:38
  • 4
    As soon as any thread modifies `std::queue` (in any manner), all accesses to it must be protected. Use a condition variable. (It's possible to implement a lock free queue, using atomic variables in the implementation, but that isn't `std::queue`.) – James Kanze Feb 14 '12 at 10:46
  • @JamesKanze: Yes. I'm going to implement that. But that doesn't bother me much, as I would have done that anyway. I just too lazy. – Nawaz Feb 14 '12 at 10:51
  • I see that you are now using sleep. Have you tried to use a blocking collection such that the producer is put to sleep if the workers are not processing items fast enough? – Tudor Feb 23 '12 at 20:00
  • @Tudor: I didn't do that. But I know for sure that other threads do their works much faster than producer thread. In the application, most of the works are done by producer thread. Other threads only update database, and that too, asynchronously. – Nawaz Feb 24 '12 at 01:18
  • Is your queue bounded or not? Does the producer ever need to wait (like queue full) or does it just put things in the queue continuously? – Tudor Feb 24 '12 at 14:32
  • @Tudor: My queue, which provides thread-safety, is a thin wrapper over `std::queue`. So I think it's unbounded. – Nawaz Feb 24 '12 at 14:48
  • @Tudor: My machine has 4 cores (I'm not sure though). But I run my application on various machines; some of which are real machine, others are VMs. Real machines has 2 or 4 cores, and VMs has 1 core. – Nawaz Feb 24 '12 at 15:02
  • Well, I'm afraid I'm running out of ideas. Any chance you can post a bit of code from the producer thread? – Tudor Feb 25 '12 at 10:01
  • Check out this example on implementing a thread-safe blocking queue: http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html – Emile Cormier Feb 26 '12 at 23:45

8 Answers8

27

Personally I'd be pretty annoyed if my threads had work to do, and there were idle cores on my machine because the OS wasn't giving them high CPU usage. So I don't really see that there's any a problem here [Edit: turns out your busy looping is a problem, but in principle there's nothing wrong with high CPU usage].

The OS/scheduler pretty much doesn't predict the amount of work a thread will do. A thread is (to over-simplify) in one of three states:

  1. blocked waiting for something (sleep, a mutex, I/O, etc)
  2. runnable, but not currently running because other things are
  3. running.

The scheduler will select as many things to run as it has cores (or hyperthreads, whatever), and run each one either until it blocks or until an arbitrary period of time called a "timeslice" expires. Then it will schedule something else if it can.

So, if a thread spends most of its time in computation rather than blocking, and if there's a core free, then it will occupy a lot of CPU time.

There's a lot of detail in how the scheduler chooses what to run, based on things like priority. But the basic idea is that a thread with a lot to do doesn't need to be predicted as compute-heavy, it will just always be available whenever something needs scheduling, and hence will tend to get scheduled.

For your example loop, your code doesn't actually do anything, so you'd need to check how it has been optimized before judging whether 5-7% CPU makes sense. Ideally, on a two-core machine a processing-heavy thread should occupy 50% CPU. On a 4 core machine, 25%. So unless you have at least 16 cores then your result is at first sight anomalous (and if you had 16 cores, then one thread occupying 35% would be even more anomalous!). In a standard desktop OS most cores are idle most of the time, so the higher the proportion of CPU that your actual programs occupy when they run, the better.

On my machine I frequently hit one core's worth of CPU use when I run code that is mostly parsing text.

if exactly one thread enqueue items to queue, then is it safe if exactly one thread deque items from it?

No, that is not safe for std::queue with a standard container. std::queue is a thin wrapper on top of a sequence container (vector, deque or list), it doesn't add any thread-safety. The thread that adds items and the thread that removes items modify some data in common, for example the size field of the underlying container. You need either some synchronization, or else a safe lock-free queue structure that relies on atomic access to the common data. std::queue has neither.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • I've posted the updates in my question. Hope you see that. :-) – Nawaz Feb 23 '12 at 17:43
  • @Nawaz: I can't really say what's consuming the CPU. Any chance you could profile and work it out? If it's a 2-core machine and CPU peaks at 50%, that suggests one thread running flat-out, although it's also possible there are two threads passing back and forth with effectively no parallelism. I'm surprised that it takes out the rest of the OS, too, something's definitely up but I don't know what. – Steve Jessop Feb 24 '12 at 09:51
  • I think profiling isn't going to help me here, as my producer code is actually a thin wrapper over [ICA protocol](http://en.wikipedia.org/wiki/Independent_Computing_Architecture) which is a huge codebase written by Citrix System almost 20 years back. Roughly, I can say that my wrapper code makes at most 5% of the whole code. Out of remaining 95% code (which is citrix's code), 15-20% is in a source code form and the rest are in `.lib` files; these lib files do most of the work. So I don't know how profiling would help me here. – Nawaz Feb 28 '12 at 16:53
  • By the way, I tried using [`THREAD_PRIORITY_LOWEST`](http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx) as thread'spriority for all threads created by me. Will it improve the situation? – Nawaz Feb 28 '12 at 16:58
  • 2
    @Nawaz: well, often the best form of "profiling" is "stack shots". Interrupt the program in the debugger and see where you are (wherever in the stack trace your code ends and Citrix's code begins). Do that 5 times, anywhere that you end up twice is hot code, work out whether or not that's code you think is reasonable for your program (and that particular thread) to be spending all its time in. – Steve Jessop Feb 28 '12 at 16:59
  • 2
    Low thread priority won't normally have any effect. Your program is doing a lot of "something". Low priority just means that if the OS has anything else to do then it will schedule that instead. That might make your machine more responsive, but it doesn't solve the underlying problem, that your program is doing more "something" than you think it should be. You need to find out what "something" it's doing, and whether it should be doing less of it. You've already knocked out the busy-looping (at least, I assume you have since Tudor told you it was bad), so you need to find the next problem. – Steve Jessop Feb 28 '12 at 17:04
  • @Nawaz: Well, I see the bounty is over. Conclusions? – Tudor Mar 01 '12 at 15:50
  • @Tudor: I'll post it sometime in detail. Its 10:00 PM here, sort of busy with my work; alone in office; everybody went home. :| – Nawaz Mar 01 '12 at 16:33
  • @Tudor: I've posted the implementation of `concurrent_blocking_queue` at CodeReview. If interested, see it here : http://codereview.stackexchange.com/questions/12052/implementation-of-concurrent-blocking-queue-for-producer-consumer – Nawaz May 25 '12 at 05:56
  • Steve: I've implemented `concurrent_blocking_queue`, and today I posted it at CodeReview. If interested, see it here : http://codereview.stackexchange.com/questions/12052/implementation-of-concurrent-blocking-queue-for-producer-consumer – Nawaz May 25 '12 at 05:58
7

Edit: Ok, since you are using busy spin to block on the queue, this is most likely the cause for high CPU usage. The OS is under the impression that your threads are doing useful work when they are actually not, so they get full CPU time. There was interesting discussion here: Which one is better for performance to check another threads boolean in java

I advise you to either switch to events or other blocking mechanisms or use some synchronized queue instead and see how it goes.

Also, that reasoning about the queue being thread-safe "because only two threads are using it" is very dangerous.

Assuming the queue is implemented as a linked list, imagine what can happen if it has only one or two elements remaining. Since you have no way of controlling the relative speeds of the producer and the consumer, this may well be the case and so you're in big trouble.

Community
  • 1
  • 1
Tudor
  • 61,523
  • 12
  • 102
  • 142
  • I added this line in my question : *But that is not the real question in this topic, however, you can add a note in your answer, addressing this.* – Nawaz Feb 14 '12 at 10:33
  • You're right, I will address it once I get more context, but I thought this was a very big issue and deserved some commenting on. – Tudor Feb 14 '12 at 10:34
  • So, I've made an edit regarding the new context you provided. – Tudor Feb 14 '12 at 10:41
  • I didn't understand why you talked about *synchronized queue* in the context of busy spin. Is *synchronized queue* not about a thread-safe queue? – Nawaz Feb 14 '12 at 12:02
  • @Nawaz: Yes it is, but by this I meant a an already implemented concurrent queue like the one in TBB. – Tudor Feb 14 '12 at 12:35
  • But has it anything to do with high CPU usage? – Nawaz Feb 14 '12 at 13:02
  • It may have in the sense that these concurrent data structures are optimized for concurrent access and will make sure to have both good performance and not waste the CPU. – Tudor Feb 14 '12 at 13:07
  • 3
    TBB concurrent_bounded_queue has a blocking pop() method that will avoid busy-waiting. Alternatively, you could use non-blocking concurrent_queue with an associated semaphore to count/wait on. – Martin James Feb 14 '12 at 13:35
5

Before you can start thinking about how to optimize your threads to consume less CPU you need to have an idea of where is all that CPU time spent. One way to obtain this information is by using a CPU profiler. If you don't have one, then give Very Sleepy a try. It's easy to use, and free.

The CPU profiler will monitor your running application and take notes of where time is spent. As a result it will give you a list of functions sorted by how much CPU they've used during the sampled period, how many times were called, etc. Now you need to look at the profiling results starting from the most CPU intensive functions and see what you can change in those to reduce the CPU usage.

The important thing is that once you have profiler results you have actual data that tells you what parts of your application you can optimize to obtain the biggest return.

Now let's consider the kinds of things you can find that are consuming a lot of CPU.

  • A worker thread is typically implemented as a loop. At the top of the loop a check is made to decide if there is work to do and any available work is executed. A new iteration of the loop begins the cycle again.

    You may find that with a setup like this most of the CPU time allocated to this thread is spent looping and checking, and very little is spent actually doing work. This is the so called busy-wait problem. To partially address this you can add a sleep in between loop iterations, but this isn't the best solution. The ideal way to address this problem is to put the thread to sleep when there is no work to do, and when some other thread generates work for the sleeping thread it sends a signal to awaken it. This practically eliminates the looping overhead, the thread will only use CPU when there is work to do. I typically implement this mechanism with semaphores, but on Windows you can also use an Event object. Here is a sketch of an implementation:

    class MyThread {
    private:
        void thread_function() {
            while (!exit()) {
                if (there_is_work_to_do())
                    do_work();
                go_to_sleep();
            }
        }
        // this is called by the thread function when it
        // doesn't have any more work to do
        void go_to_sleep() {
            sem.wait();
        }
    public:
        // this is called by other threads after they add work to
        // the thread's queue
        void wake_up() {
            sem.signal();
        }
    };
    

    Note that in the above solution the thread function always tries to go to sleep after executing one task. If the thread's queue has more work items, then the wait on the semaphore will return immediately, since each time an item was added to the queue the originator must have called the wake_up() function.

  • The other thing you may see in the profiler output is that most of the CPU is spent in functions executed by the worker thread while it is doing work. This is actually not a bad thing, if most of the time is spent working, then that means that the thread had work to do and there was CPU time available to do that work, so in principle there is nothing wrong here.

    But still, you may not be happy that your application uses so much CPU, so then you need to look at ways to optimize your code so that it does the the work more efficiently.

    For example, you may find that some little auxiliary function was called millions of times, so while a single run of the function is quick, if you multiply that by a few million it becomes a bottle neck for the thread. At this point you should look at ways to make optimizations to reduce the CPU usage in this function, either by optimize its code, or by optimizing the caller(s) to call the function less times.

    So the strategy here is to start from the most expensive function according to the profiling report and try to make a small optimization. Then you rerun the profiler to see how things changed. You may find that a small change to the most CPU intensive function moves it down to 2nd or 3rd place, and as a result the overall CPU usage was reduced. After you congratulate yourself for the improvement, you repeat the exercise with the new top function. You can continue this process until you are satisfied that your application is as efficient as it can be.

Good luck.

Miguel Grinberg
  • 65,299
  • 14
  • 133
  • 152
3

Although the others have correctly analysed the problem already (as far as I can tell), let me try to add some more detail to the proposed solutions.

Firstly, to summarize the problems: 1. If you keep your consumer thread busy spinning in a for-loop or similar, that's a terrible waste of CPU power. 2. If you use the sleep() function with a fixed number of milliseconds, it is either a waste of CPU, too (if the time amount is too low), or you delay the process unnecessarily (if it's too high). There is no way to set the time amount just right.

What you need to do instead is to use a type of sleep that wakes up just at the right moment, i.e. whenever a new task has been appended to the queue.

I'll explain how to do this using POSIX. I realize that's not ideal when you are on Windows, but, to benefit from it, you can either use POSIX libraries for Windows or use corresponding functions available in your environment.

Step 1: You need one mutex and one signal:

#include <pthread.h>
pthread_mutex_t *mutex  = new pthread_mutex_t;
pthread_cond_t  *signal = new pthread_cond_t;

/* Initialize the mutex and the signal as below.
   Both functions return an error code. If that
   is not zero, you need to react to it. I will
   skip the details of this. */
pthread_mutex_init(mutex,0);
pthread_cond_init(signal,0);

Step 2: Now inside the consumer thread, wait for the signal to be sent. The idea is that the producer sends the signal whenever it has appended a new task to the queue:

/* Lock the mutex. Again, this might return an error code. */
pthread_mutex_lock(mutex);

/* Wait for the signal. This unlocks the mutex and then 'immediately'
   falls asleep. So this is what replaces the busy spinning, or the
   fixed-time sleep. */
pthread_cond_wait(signal,mutex);

/* The program will reach this point only when a signal has been sent.
   In that case the above waiting function will have locked the mutex
   right away. We need to unlock it, so another thread (consumer or
   producer alike) can access the signal if needed.  */
pthread_mutex_unlock(mutex);

/* Next, pick a task from the queue and deal with it. */

Step 2 above should essentially be placed inside an infinite loop. Make sure there is a way for the process to break out of the loop. For example -- although slightly crude -- you can append a 'special' task to the queue that means 'break out of the loop'.

Step 3: Enable the producer thread to send a signal whenever it has appended a task to the queue:

/* We assume we are now in the producer thread and have just appended
   a task to the queue. */
/* First we lock the mutex. This must be THE SAME mutex object as used
   in the consumer thread. */
pthread_mutex_lock(mutex);

/* Then send the signal. The argument must also refer to THE SAME
   signal object as is used by the consumer. */
pthread_cond_signal(signal);

/* Unlock the mutex so other threads (producers or consumers alike) can
   make use of the signal. */
pthread_mutex_unlock(mutex);

Step 4: When everything is finished and you shut down your threads, you must destroy the mutex and the signal:

pthread_mutex_destroy(mutex);
pthread_cond_destroy(signal);
delete mutex;
delete signal;

Finally let me re-iterate one thing the others have said already: You must not use an ordinary std::deque for concurrent access. One way of solving this is to declare yet another mutex, lock it before every access to the deque, and unlock it right after.

Edit: A few more words about the producer thread, in light of the comments. As far as I understand it, the producer thread is currently free to add as many tasks to the queue as it can. So I suppose it will keep doing that and keep the CPU busy to the extent that it isn't delayed by IO and memory access. Firstly, I don't think of the high CPU usage resulting from this as a problem, but rather as a benefit. However, one serious concern is that the queue will grow indefinitely, potentially causing the process to run out of memory space. Hence a useful precaution to take would be to limit the size of the queue to a reasonable maximum, and have the producer thread pause whenever the queue grows too long.

To implement this, the producer thread would check the length of the queue before adding a new item. If it is full, it would put itself to sleep, waiting for a signal to be sent by a consumer when taking a task off the queue. For this you could use a secondary signal mechanism, analogous to the one described above.

jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • As I've said in the updates (in my question), that currently my focus is the producer thread, as consumer thread doesn't take CPU cycles much. I very well understand that consumer thread should use events/signals ideally, not sleep, and soon I will do that. In this topic, I'm looking for techniques to reduce CPU usage taken by producer thread. – Nawaz Feb 24 '12 at 12:16
  • @Nawaz: As your update (for the bounty) clearly states: "I'm looking for techniques to reduce the CPU usage taken by the consumer thread." It says "consumer", not "producer". Also, what I suggest above affects both threads. – jogojapan Feb 24 '12 at 12:22
  • @Nawaz: One more thing. If you are going to implement signals anyway, I'd suggest that you do that first and check if it solves your problem (possibly in *both* threads). – jogojapan Feb 24 '12 at 12:24
  • Oh that was a typo. I just added a correct statement in my question, as I cannot edit the bounty statement. – Nawaz Feb 24 '12 at 12:26
  • It is very unlikely that signals will solve the CPU usage problem of **producer** thread. I don't see why would it solve it. But I know that I've to implement signal mechanism! – Nawaz Feb 24 '12 at 12:27
  • Oh wow. That's confusing. Not your fault, though. But still, I believe carefully implementing this and testing what it does to the performance in both threads is a good idea. – jogojapan Feb 24 '12 at 12:27
  • @Nawaz: Why do you think that using a blocking mechanism like signals or a blocking queue will not solve the problem? Blocked threads are not scheduled by the OS until the event that wakes them up occurs. – Tudor Feb 24 '12 at 12:50
  • @Tudor: How would the signal mechanism would block the producer thread? The consumer is supposed to wait for signals, right? Producer can keep going on, when the consumer waits. – Nawaz Feb 24 '12 at 12:54
  • @Nawaz: That's true. The producer can keep going on, and that is probably why it causes high CPU usage. On the other hand, it should not really go on like this because at some point the queue (or the memory) will be used up. A signal mechanism can help here, too, and as a side effect reduce the CPU usage of the producer thread to just the right amount (in balance with how much the consumers can take). I have edited the answer to describe this. – jogojapan Feb 24 '12 at 17:17
  • @jogojapan: Yes, momentarily the producer will be blocked when the consumer will be removing items from queue. But I think that will not improve the situation much. On Monday, I will implement the signal mechanism, as I've to do it sometime anyway, and then see what happens. Maybe, it comes with surprising results. – Nawaz Feb 24 '12 at 17:28
  • @Nawaz: Why not use something like tbb:concurrent_bounded_queue? It has blocking and thread-safety implemented already: http://threadingbuildingblocks.org/files/documentation/a00129.html#_details – Tudor Feb 24 '12 at 20:21
3

Threads consume resources such as memory. A blocking/unblocking thread incurs a once off cost. If a thread blocking/unblocks tens of thousands of times per second this can waste significant amounts of CPU.

However once a thread is blocked, it doesn't matter how long it is blocked for, there is no ongoing cost. The popular way to find performance problems is to use profilers.

However, I do this a lot, and my method is this: http://www.wikihow.com/Optimize-Your-Program%27s-Performance

Avinash
  • 790
  • 2
  • 11
  • 26
1

As people have said, the right way to synchronize the hand-off between the producer and consumer threads would be to use a condition variable. When the producer wants to add an element to the queue, it locks the condition variable, adds the element, and notifies waiters on the condition variable. The consumer waits on the same condition variable, and when notified, consumes elements from the queue, then locks again. I'd personally recommend using boost::interprocess for these, but it can be done in a reasonably straightforward way using other APIs too.

Also, one thing to keep in mind is that while conceptually each thread is operating on one end of the queue only, most libraries implement an O(1) count() method, which means they have a member variable to track the number of elements, and this is an opportunity for rare and difficult-to-diagnose concurrency issues.

If you're looking for a way to reduce the cpu usage of the consumer thread (yes, I know this is your real question)... well, it sounds like it's actually doing what it's supposed to now, but the data processing is expensive. If you can analyze what it's doing, there may be opportunities for optimization.

If you want to throttle the producer thread intelligently... it's a little more work, but you could have the producer thread add items to the queue until it reaches a certain threshold (say 10 elements), then wait on a different condition variable. When the consumer consumes enough data that it causes the number of queued elements to go below a threshold (say 5 elements), then it notifies this second condition variable. If all parts of the system can move the data around quickly, then this could still consume a lot of CPU, but it would be spread relatively evenly amongst them. It's at this point that the OS should be responsible for letting other unrelated processes get their fair(ish) share of the CPU.

bdow
  • 181
  • 4
  • Also, remember that it's important not to hold locks for longer than you need to. Protect against concurrent access to the shared data (the queue), but after adding or removing an element, you don't need to protect any more. – bdow Feb 28 '12 at 16:21
0

Thread CPU usage depends on many factors, but in the main the OS can only assign processing time based on points at which it can interrupt a thread.

If your thread interacts with hardware in anyway then this gives the OS a chance to interrupt the thread and assign processing elsewhere, mainly based on the assumption that hardware interaction takes time. In your example you're using the iostream library and thus interacting with hardware.

If your loop didn't have this then it would most likely use nearly 100% cpu.

ChrisBD
  • 9,104
  • 3
  • 22
  • 35
  • 1
    I'm pretty sure that Nawaz has an OS with pre-emptive multitasking, which IIRC was introduced to Microsoft's customers with Windows 95, but was in IBM's OS/2 before that. Therefore the OS can interrupt the thread almost anywhere (the exceptions being occasional bits of code deep in kernel mode with interrupts disabled, or in the middle of so-called "atomic" operations). – Steve Jessop Feb 14 '12 at 10:57
0
  1. use asynchronous (file and socket) IO to reduce useless CPU waiting time.
  2. use vertical threading model to reduce context switch if possible
  3. use lock-less data structure
  4. use a profiling tool, such as VTune, to figure out the hot spot and make optimization
BruceAdi
  • 1,949
  • 1
  • 11
  • 8