1

My question is very similar to this one here. But since it was asked 8 years ago, I was wondering if there are any better ways. Also I am new to linux programming, so please be gentle!

My scenario is as below -

enter image description here

The operations performed are -

  1. Process 1 and 2 are created with 3 threads. Threads in Process 2 are put to sleep immediately.
  2. Process 1 threads are mapped to Core 1, Process 2 threads are mapped to Core 2.
  3. A Thread in Process 1 enqueues a message in shared memory queue and notifies (currently via futexes) Process 2.
  4. A thread in Process 2 wakes up to dequeue the message.

As is obvious from the use of futexs, I am concentrating on thread wakeup latency. My target is to get time taken by steps 3 and 4 to be < 3 microseconds.

Following are results which I have been able to get for steps 3 & 4, by using different IPC synchronization mechanisms -

Posix mutexs + condition variables - Avg 14.14 microseconds

Pipes - Avg 28.54 microseconds

Futexs - Avg 12.48 microseconds

I know these results are very subjective and will vary from machine to machine. What I need is suggestions on how I can make steps 3 and 4 even faster?

So far I have looked into -

  • futexes
  • mutexes
  • conditions
  • semaphores
  • spinlocks (cannot use them because Threads in Process 2 are already sleeping)

Please help!!

Assume that the architecture cannot be changed, i.e., there have to be two process with threads. Everything else can be changed.

Community
  • 1
  • 1
r10a
  • 33
  • 7
  • How and *why* do you put process 2 threads to sleep? How big are your messages? Given your apparent performance requirements, every little step you can eliminate is probably important. For example, if you attach shared memory to the same virtual address in each process, you can just pass a pointer instead of the actual data. I'd also recommend experimenting with pinning your processes to different CPUs - on the same/different physical chips, with and without hyperthreading, or maybe even pinning them to the exact same CPU. – Andrew Henle Dec 03 '18 at 23:54
  • You might also want to look into something like [IPC message queues](https://stackoverflow.com/questions/22661614/message-queue-in-c-implementing-2-way-comm) and Unix sockets to send messages. The nice thing about those is that you don't have to use a separate message notification, removing some steps in your process. – Andrew Henle Dec 03 '18 at 23:59
  • Also, you didn't post performance numbers on semaphores. – Andrew Henle Dec 04 '18 at 00:00
  • @AndrewHenle Thank you for your response. Since I use futexes in my example, it has a flag of `FUTEX_WAIT ` which will sleep till the futex address becomes an expected value. It sort of works like a `pthread_cond_wait`. [docs](http://man7.org/linux/man-pages/man2/futex.2.html) – r10a Dec 04 '18 at 01:33
  • @AndrewHenle I did not know we can pin processes to cores. I'll look into that. Thanks! I haven't posted semaphores because they are much heavier than mutexes. So I already know the performance is going to worse. – r10a Dec 04 '18 at 01:34
  • @AndrewHenle Also I referred [this](https://dl2.pushbulletusercontent.com/30k5suAkggBzLPkWplHnusJz1iw4aa8d/2012-usenix-ipc-draft1.pdf) paper for checking performance of IPC mechanisms. So I know Unix sockets are already slower than futexes for my consideration. – r10a Dec 04 '18 at 01:35
  • On my system (Core i5 7200U laptop), I measure about 7µs median for Unix domain datagram socket round-trip between two processes, and about 0.7µs median for a semaphore pair round-trip. – Nominal Animal Dec 04 '18 at 12:46
  • @NominalAnimal Woah! Any chance I can take a look at your code? – r10a Dec 04 '18 at 20:54
  • After checking my code, I noticed a bug that skewed the timings for the semaphore case. [Here](https://pastebin.com/Yg2zaA4v) (`gcc -Wall -O2 semaphore.c -lpthread -lrt -o sem-test`) is the test using semaphores on anonymous shared memory. When the CPU frequency ramps up on i5 7200U laptop (HP EliteBook 840 G4), and the machine is otherwise quiescent, the roundtrip time settles down to less than 1100 ns (1.1µs per round-trip). – Nominal Animal Dec 05 '18 at 04:03
  • [Here](https://pastebin.com/trhb0MLH) is the test using separate measure and responder processes using POSIX shared memory (`shm_open()`, `mmap()`). This has more variance in the measurements, but does also get under 1100 ns per round-trip on a quiescent machine. (You can modify the output to display the fraction of round-trips that fulfill your timing criterion instead.) – Nominal Animal Dec 05 '18 at 04:11
  • In general, you should not try to implement such a design (requiring interprocess communication with tight deadlines) on non-realtime systems, because any kind of CPU load will blow your latencies (round-trip times, effectively) to millisecond range. To get such tight latencies, you need to keep data, requests, and responses local, and avoid context switches (between processes, and between processes and the kernel). – Nominal Animal Dec 05 '18 at 04:14
  • While my tests above you **can** obtain microsecond-range round-trip times using semaphores, they are **not a guarantee** that such round-trip times are achievable in practice. Both test programs are basically an optimal situation test, with lots of repeated measurements, which lets the kernel to adjust to the workload. Whether such tight latencies are achievable in practice, on practical servers, can be achieved at all using separate processes, I don't know; but I do not think so. – Nominal Animal Dec 05 '18 at 04:17

0 Answers0