11

My application contains several latency-critical threads that "spin", i.e. never blocks. Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:

void Processor::ConnectionThread()
{
    while (work)
    {
        Iterate();
    }
}

I do not see "100% occupied" core in Task manager, overall system load is 36-40%.

But if I change it to this:

void Processor::ConnectionThread()
{
    SetThreadAffinityMask(GetCurrentThread(), 2);
    while (work)
    {
        Iterate();
    }
}

Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.

Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?

I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.

upd found this slide which shows that binding busy-waiting thread to CPU may help:

enter image description here

Oleg Vazhnev
  • 23,239
  • 54
  • 171
  • 305
  • 1
    FWIW, a busy thread only migrates to a different core when the thread scheduler runs. Depending on your O/S that typically happens in the range of every 10-15 milliseconds. 10 milliseconds is an eon on modern CPUs. – 500 - Internal Server Error Sep 19 '14 at 12:32
  • Windows tries not to overheat the cores by moving heavy duty threads around. Better not bind the threads to the cores without an actual concrete and compelling reason. – Dialecticus Sep 19 '14 at 12:34
  • 1
    better latency is actual concrete and compelling reason – Oleg Vazhnev Sep 19 '14 at 12:36
  • Good enough latency is compelling. Just better is not compelling enough, if you already have good enough. – Dialecticus Sep 19 '14 at 12:40
  • 1
    Did you measure a decrease in latency? – eerorika Sep 19 '14 at 12:43
  • No I don't know how to measure latency, that's why I'm asking this question. As I understand with thread affinity latency must be better but it's nice to know how to check this :) – Oleg Vazhnev Sep 19 '14 at 12:49
  • It is not a lock that overall latency will be better. The effect of reduced choice in selecting which set of threads to run on the available cores may make a complex, multithreaded system slower overall. – Martin James Sep 19 '14 at 14:40
  • @MartinJames that's why I'm asking a question - to know how to implement this to have the best latency – Oleg Vazhnev Sep 19 '14 at 14:53
  • It might be beneficial to share what kind of application it actually is. Typically for "real time" application you are better off with a dedicated embedded microcontroller which you can use for the latency critical stuff and if needed interface that to a PC "control" application. – dtech Sep 21 '14 at 19:39
  • @ddriver The original application was running on windows, which doesn't have any realtime capabilities, so I'm guessing it doesn't have any realtime requirements. If there are realtime requirements though, supporting them would generally be at the cost of overall latency. Using an realtime operating system (rtlinux, vxworks, qnx, etc...) and realtime priorities will reduce jitter though. – Jason Sep 22 '14 at 03:23
  • @Jason i'm more interested about stock rhel, i don't plan to use real-time os – Oleg Vazhnev Sep 22 '14 at 08:01
  • What is this thread actually doing? – David Schwartz Sep 23 '14 at 18:59
  • What is the hard latency limit you can accept in the thread? the minimum time for 10 GBit net card to get a new package is around 1000ns+driver time which should be around 1000-5000ns depending on how hot the cache for the driver is. – Surt Sep 23 '14 at 21:04
  • i'm writing HFT software. so I want to happen things as fast as possible. but, at least for now, i don't plan to use real-time OS, only stock RHEL + 10 GB Sollarflare or mellanox network card – Oleg Vazhnev Sep 24 '14 at 05:13
  • So every ns could cost lots of $. Your flow will be GetDataFromNet, Decide(Buy/Sell/Do nothing), SendDataToNet. Performance measure will tell you if all 3 should be done on the same core, or split into producers and consumers. You can not make a setup that uses 100% CPU as that will pause some threads for longer times 10ms on stock RHEL afair. Leave 1-2 core totally free for incident runs of the OS. – Surt Sep 24 '14 at 10:38
  • of course i'm not going to spin/use all available cores. i plan to use single 10-core processor so it's ok to leave 2 cores for OS, rest 8 i can use for my job. – Oleg Vazhnev Sep 24 '14 at 11:08
  • 1
    @javapowered Just a bit of architectural advice, but usually you want to optimize the algorithms before you start digging down to optimizing operating system interaction. Memory allocation also is huge; avoid the heap where possible. Memory is also one of the biggest bottlenecks; so optimize for efficient cache usage and don't share data across threads that isn't necessary. Writing to disk will utterly kill performance. You also want to avoid system calls, as those will trigger context switches which cost roughly 2us on modern chips. source - I write algorithms for non-displayed markets. – Jason Sep 24 '14 at 23:45
  • Also, RTLinux and real-time priority threads generally won't reduce latency over stock linux (especially if the cpus aren't under contention). It will minimize jitter though. You may see rare spikes in latency with stock linux, but the average latency will still be lower. – Jason Sep 24 '14 at 23:54
  • I prefer better average latency, rare spikes are ok. – Oleg Vazhnev Sep 25 '14 at 08:44
  • @javapowered If your strategy is highly sensitive to latency, you may also want to `nice` or `renice` your process as root (someone mentioned the windows api methods below). – Jason Sep 25 '14 at 23:10
  • @javapowered I added a decent link on intel processor topology enumeration which you may want to take a look at. – Jason Sep 27 '14 at 05:40
  • @Jason thanks, how am I supposed to use it? why it's important? – Oleg Vazhnev Sep 28 '14 at 17:06
  • @javapowered The topology is the information the scheduler uses to try to make scheduling decisions (cpu migrations, etc...). For example, @Surt mentioned SMT and it's performance impact. The processor topology tells you what cpus are SMT peers. It also tells you which cpus share levels of cache, which are generally faster to use for inter-thread communication. The most important thing is still to measure first before optimizing though. So for example, if you start profiling with `perf` or by using `rdtsc` and you see a lot of cache misses or high delays, it may make sense to pin threads. – Jason Sep 28 '14 at 17:35

5 Answers5

6

Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.

When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.

One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.

The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.

volatile uint64_t rdtsc() {
   register uint32_t eax, edx;
   asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
   return ((uint64_t) edx << 32) | (uint64_t) eax;
}
  • note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
  • also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)

So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.

Some additional reading...

Jason
  • 3,777
  • 14
  • 27
  • thanks, so I should call `rdtsc` in producer thread, then call `rdtsc` in consumer thread, calculate difference. then I should try to bound consumer thread to certain core, or unbound it, and compare what's better? do you have complete example of how `rdtsc` should be used? also if there are any external Linux tools which can be used to verify how good latency of my program? probably to measure count of context switches or duration or something like that? – Oleg Vazhnev Sep 22 '14 at 08:10
  • @javapowered No problem. You can calculate the deltas between `rdtsc` (you'd probably want `rdtscp` across threads), but you might add extra latency by sharing the variable (especially if it crosses a cache line boundary). It's usually more accurate to calculate timings per thread. You would probably want to make the function `inline` as well. With linux, you can use [perf](https://code.google.com/p/kernel/wiki/PerfUserGuide) to get statistics on everything from context switches to page faults to cache misses. If you have `perf` installed, try `perf list` to see what you can record. – Jason Sep 22 '14 at 15:44
  • As for specifically making sure latency stays the same or improves, I would try to measure using `rdtsc` across both versions first. With enough data points, you can compare the statistics (min, max, avg, stddev). If the new version has worse latency, then you may want to try to diagnose using `perf`. If you see a lot of cache misses or cpu migrations you may want to explicitly set affinities to processors which share caches (`pthread_setaffinity_np` or `sched_setaffinity`). The linux scheduler is usually very good at this though. – Jason Sep 22 '14 at 17:01
  • I added a couple more useful links. Preshing's site has a lot of excellent information on effective multi-processor programming (avoid `volatile`). Memory allocation is also generally big bottleneck, so understanding it's a good idea to understand the underlying heap mechanism. – Jason Sep 27 '14 at 05:20
  • if `rdtsc` is better than c++11 high res timer? http://stackoverflow.com/a/5524138/93647 – Oleg Vazhnev Mar 17 '15 at 08:34
  • @javapowered A lot of libraries (probably most C++ 11/14 implementations) will actually effectively implement timer calls in terms of, or directly as `rdtsc` or `rdtscp`. 74ns could be more than one instruction, but it depends on CPU and frequency. There's always tradeoffs between code complexity and accuracy/performance. You can usually use `objdump` with POSIX OSes to see what the compiler actually emits. – Jason Mar 24 '15 at 21:59
6

Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.

The reasons(R) are

  • your code is likely to be in your iCache
  • the branch predictors are tuned to your code
  • your data is likely to be ready in your dCache
  • the TLB points to your code and data.

Unless

  • Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
    • Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
    • It saves for each reason R mentioned above, that is still there.
    • If the evil twin also just spins the costs should be much lower.
  • Or your allowing interrupts on your core, in which case you get the same problems and
    • your TLB is flushed
    • you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
  • Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
    • switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
  • Or you use any system calls that cause context switches.
    • No disk IO at all.
    • only async IO else.
  • having more active (none-paused) threads than cores increases the likelihood of problems.

So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core. The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.

The disadvantages of locking a thread to a single core are:

  • It will cost some total throughput.
    • as some threads that might have run if the context could have been switched.
    • but the latency is more important in this case.
  • If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
    • Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
    • setting its priority higher will lessen the problem, but not eliminate it.
    • schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
    • If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.

Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.

Other links:

Discussion on interrupts.

Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.

Interprocess communication in 100ns

Community
  • 1
  • 1
Surt
  • 15,501
  • 3
  • 23
  • 39
  • how can i tell Linux not to use the core for anything else except my thread? disable interrupts and everything else – Oleg Vazhnev Sep 23 '14 at 16:53
  • /dev/cpuset is a pseudo filesystem where you can set up the cpu sharing, one of the options is cpu_exclusive but that only works on cpu level, not core level. And sched_setscheduler can set your process in SCHED_FIFO, but then you will have to do your own scheduling. – Surt Sep 23 '14 at 18:04
  • That's a good point, SMTs generally share *everything* but registers. The linux scheduler is usually aware of SMT CPUs though (I'd guess RHEL kernel, not sure about windows). I still think `perf` is the best tool to diagnose those issues first. Maybe a link to the rtlinux wiki might be helpful? – Jason Sep 23 '14 at 19:17
  • Added info about pthread_setschedparam, setting scheduling on a pthread basis. – Surt Sep 24 '14 at 09:28
  • both answers are very good but I can give bounty to only one, but I upvoted your answer ) – Oleg Vazhnev Sep 27 '14 at 05:16
  • i'm using RHEL 7.1 now. still don't understand what should I do. I do not want to do own scheduling, then I should not use sched_setscheduler? but what should I do? just bind thread to core somehow using some c++ setaffinity function and that's it? – Oleg Vazhnev Mar 17 '15 at 15:32
2

I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count. After reading all the answers, I decided to implement and benchmark 4 different approaches

  • busy wait with no affinity set
    • busy wait with affinity set
    • observer pattern
    • signals

The imbatible winner was "busy wait with affinity set". No doubt about it.

Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.

My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.

1

Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.

The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.

Raising the priority will reduce the effect (but not eliminate it).

Read more about SetThreadPriority and SetProcessPriorityBoost. There some details in the docs you need to understand.

egur
  • 7,830
  • 2
  • 27
  • 47
-1

This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.

So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • The scheduler isn't completely omniscient though, and it has to balance resource usage across tasks. There is a limit to the amount of information it can consider in making scheduling decisions. – Jason Sep 23 '14 at 19:25
  • @Jason Nevertheless, it won't move a task to a different core unless it has a good reason to. It might fail to do something you might want it to do, but it's not going to actively do something it doesn't have to do without a reason. – David Schwartz Sep 23 '14 at 19:44
  • I agree, kernel schedulers are mostly *extremely* good at what they do. An immense amount of research has gone into the linux scheduler. They're not entirely infallible though, which I think is why it's probably best to measure with something like `perf` first. – Jason Sep 23 '14 at 19:53
  • The scheduler will move the thread to a different core if it gets context switched out and then a different core becomes free before its original does. This costs 5000ns-50000ns in addition to the inactive time. – Surt Sep 24 '14 at 10:01
  • @Surt The scheduler *can* move the thread to a different core if a different core becomes free before the original does. It doesn't have to. It can wait for the original core to become free. It will do whichever it thinks is best under the circumstances which should be what you want. Would you prefer to wait for the original core to become idle while another perfectly good core is wasted even when the schedulers thinks this is foolish? – David Schwartz Sep 24 '14 at 10:08
  • 1
    No I would like the scheduler to throw out the squatter :) – Surt Sep 24 '14 at 10:39
  • @Surt If there's a squatter, the caches are blown out anyway. There's no need to throw out the squatter if another core is free. The scheduler knows all this stuff and is designed by really smart guys. – David Schwartz Sep 24 '14 at 10:42
  • There is because it costs way more to change core, see [context switch](http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html) – Surt Sep 24 '14 at 10:43
  • @Surt Right, but in this case, you've already paid all those costs. So there's nothing to save. There's no advantage to throwing out the squatter. – David Schwartz Sep 24 '14 at 10:50
  • Not if the OS preempts the squatter immediately. The OP wants the lowest possible latency on his netcard at **any** cost, this means binding the thread to that core and basically making it exclusive to this thread. See the comments under the OP. – Surt Sep 24 '14 at 10:56
  • @Surt I don't agree. The caches are already blown out. You've already paid the costs, and there's an idle core. But in any event, it doesn't matter. The scheduler, we hope, knows which of us is right. – David Schwartz Sep 24 '14 at 11:00