Improving Performance of Shared Memory - Isolating Linux Core

Question

I'm trying to optimize the performance of reading and writing a double to shared memory. I have one program writing to shared memory, and another reading from it.

I've used this post to help isolate CPUs for these two programs to run on, with the following line in my etc/default/grub file:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1 isolcpus=6,7"

I am using taskset -c 6 writer and taskset -c 7 reader to set these programs to run on these cpus.

Using this man page on sched_setscheduler, I have set up both programs to have the highest scheduling priority using the following code:

struct sched_param param;   
param.sched_priority = sched_get_priority_max(SCHED_FIFO);

if(sched_setscheduler(0, SCHED_FIFO, &param) == -1) 
{
     perror("sched_setscheduler failed");
     exit(-1);
}

I have defined a struct to be used in shared memory that contains the required synchronization tools, as well as a timespec struct and a double to pass between the two programs, as follows:

typedef struct
{
    // Synchronization objects
    pthread_mutex_t ipc_mutex;
    sem_t ipc_sem;
    // Shared data
    double value;
    volatile int read_cond;
    volatile int end_cond;
    double start_time;
    struct timespec ts;
} shared_data_t;

Shared Memory Initialization:

Writer:

// ftok to generate unique key 
key_t key = ftok("shmfile",65); 

// shmget returns an identifier in shmid 
int shmid = shmget(key,1024,0666|IPC_CREAT); 
ftruncate(shmid, sizeof(shared_data_t));

// shmat to attach to shared memory 
shared_data_t* sdata = (shared_data_t*) shmat(shmid,(void*)0,0); 
sdata->value = 0;

Reader:

// ftok to generate unique key 
key_t key = ftok("shmfile",65); 

// shmget returns an identifier in shmid 
int shmid = shmget(key,1024,0666|IPC_CREAT); 
ftruncate(shmid, sizeof(shared_data_t));

// shmat to attach to shared memory 
shared_data_t* sdata = (shared_data_t*) shmat(shmid,(void*)0,0);

Initialization of Synchronization Tools in Writer

pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&sdata->ipc_mutex, &mutex_attr);
sem_init(&sdata->ipc_sem, 1, 0);

Write Code

for (int i = 0; i < 20000000; ++i)
    {
        pthread_mutex_lock(&sdata->ipc_mutex);
        sdata->value++;
        clock_gettime(CLOCK_MONOTONIC, &sdata->ts);
        sdata->start_time = (BILLION*sdata->ts.tv_sec) + sdata->ts.tv_nsec;
        sdata->read_cond = 1;
        pthread_mutex_unlock(&sdata->ipc_mutex);
        sem_wait(&sdata->ipc_sem);
    }
fprintf(stderr, "done writing\n" );

pthread_mutex_lock(&sdata->ipc_mutex);
sdata->end_cond = 1;
pthread_mutex_unlock(&sdata->ipc_mutex);

Read Code

double counter = 0;
double total_time = 0;
double max_time = 0;
double min_time = BILLION;
double max_thresh = 1000;
int above_max_counter = 0;
double last_val = 0;
while (1) {

        pthread_mutex_lock(&sdata->ipc_mutex);
        while (!sdata->read_cond && !sdata->end_cond) {
            pthread_mutex_unlock(&sdata->ipc_mutex);
            pthread_mutex_lock(&sdata->ipc_mutex);
        }

        clock_gettime(CLOCK_MONOTONIC, &sdata->ts);
        double time_to_read = (BILLION*sdata->ts.tv_sec) + sdata->ts.tv_nsec - sdata->start_time;

        if (sdata->end_cond) {
            break;
        }

        if (sdata->value != last_val + 1) {
            fprintf(stderr, "synchronization error: val: %g, last val: %g\n", sdata->value, last_val);
        }
        last_val = sdata->value;

        if (time_to_read > max_time) {
            max_time = time_to_read;
            printf("max time: %lf, counter: %ld\n", max_time, (long int) counter);
        }  
        if (time_to_read < min_time) min_time = time_to_read;

        if (time_to_read > max_thresh) above_max_counter++;
        total_time += time_to_read;
        counter++;

        sdata->read_cond = 0;
        sem_post(&sdata->ipc_sem);
        pthread_mutex_unlock(&sdata->ipc_mutex);

    }

fprintf(stderr, "avg time to read: %g\n", total_time / counter);
fprintf(stderr, "max time to read: %g\n", max_time);
fprintf(stderr, "min time to read: %g\n", min_time);
fprintf(stderr, "count above max threshhold of %g ns: %d\n", max_thresh, above_max_counter);

Cleanup in Writer

//detach from shared memory 
shmdt(sdata);

Cleanup in Reader

pthread_mutex_unlock(&sdata->ipc_mutex);
pthread_mutex_destroy(&sdata->ipc_mutex);

//detach from shared memory 
shmdt(sdata); 

// destroy the shared memory 
shmctl(shmid,IPC_RMID,NULL);

The goal is to minimize the amount of time spent between these two operations. Ideally, I would like to be able to guarantee that the time to read from the time the value is written is less than 1 microsecond. However, the output I get:

max time: 5852.000000, counter: 0
max time: 18769.000000, counter: 30839
max time: 27416.000000, counter: 66632
max time: 28668.000000, counter: 1820109
max time: 121362.000000, counter: 1853346
done writing
avg time to read: 277.959
max time to read: 121362
min time to read: 60
count above max threshhold of 1000 ns: 1871

indicates that there are a number of times (~.01% of reads) where the read exceeds 1 us, and can go as high as 121us.

My question is as follows:

What could be causing these spikes, since I have set the priority to highest and isolated the CPU on which these programs are running?

I have learned from this post that I should not expect clock_gettime to have nanosecond accuracy. Are these spikes simply inaccuracies in clock_gettime?

The other option I considered is that these cores (6 and 7) are being interrupted somehow, despite having been set as highest priority.

Any help would be greatly appreciated.

EDIT

Per comment below, here is the contents of my /proc/interrupts file:

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
   0:         20          0          0          0          0          0          0          0   IO-APIC    2-edge      timer
   1:          2          0          0          0          0          0          0          0   IO-APIC    1-edge      i8042
   8:          1          0          0          0          0          0          0          0   IO-APIC    8-edge      rtc0
   9:          0          0          0          0          0          0          0          0   IO-APIC    9-fasteoi   acpi
  12:          2          0          0          0          1          1          0          0   IO-APIC   12-edge      i8042
  16:          0          0          0          0          0          0          0          0   IO-APIC   16-fasteoi   i801_smbus, pcim_das1602_16
  19:          2          0          0          0          8         10          6          2   IO-APIC   19-fasteoi 
 120:          0          0          0          0          0          0          0          0   PCI-MSI 16384-edge      aerdrv
 121:         99        406          0          0         14       5960          6          0   PCI-MSI 327680-edge      xhci_hcd
 122:       8726        133         47         28       4126       3910      22638        795   PCI-MSI 376832-edge      ahci[0000:00:17.0]
 123:          2          0          0          0          2          0          3       3663   PCI-MSI 520192-edge      eno1
 124:       3411          0          2          1        176      24498         77         11   PCI-MSI 32768-edge      i915
 125:         45          0          0          0          3          6          0          0   PCI-MSI 360448-edge      mei_me
 126:        432          0          0          0        144        913         28          1   PCI-MSI 514048-edge      snd_hda_intel:card0
 NMI:          1          1          1          1          1          1          1          1   Non-maskable interrupts
 LOC:      12702      10338      10247      10515       9969      10386      16658      13568   Local timer interrupts
 SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
 PMI:          1          1          1          1          1          1          1          1   Performance monitoring interrupts
 IWI:          0          0          0          0          0          0          0          0   IRQ work interrupts
 RTR:          7          0          0          0          0          0          0          0   APIC ICR read retries
 RES:       4060       2253       1026        708        595        846        887        751   Rescheduling interrupts
 CAL:      11906      10423      11418       9894      14562      11000      21479      11223   Function call interrupts
 TLB:      10620       8996      10060       8674      13172       9622      20121       9838   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0          0          0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:          2          2          2          2          2          2          2          2   Machine check polls
 ERR:          0
 MIS:          0
 PIN:          0          0          0          0          0          0          0          0   Posted-interrupt notification event
 PIW:          0          0          0          0          0          0          0          0   Posted-interrupt wakeup event

I've tried changing the smp affinity for interrupts 122 and 123 to cores 0 and 1, per this post, which appears to do nothing, as when I reset my computer, these affinities are still set to cores 6 and 7, respectively.

Even without resetting and simply re-running my programs(s), I see no change in the number of interrupts serviced by these CPU cores.

See `/proc/interrupts` to find which cores are servicing which interrupts. — stark, Feb 19 '19 at 20:43
`pthread_mutex_unlock(&sdata->ipc_mutex); pthread_mutex_lock(&sdata->ipc_mutex);` - I, personally, don't think that having this sort of thing in your program is a reasonable path to pursue. What exactly are you trying to achieve here? Why not use a proper `pthread_cond_t` mechanism? — oakad, Feb 20 '19 at 01:37
@oakad - Waking the sleeping thread through the use of conditional variables resulted in a latency that was too high for my requirements, which is why I've implemented a busy waiting solution here. — aketcham, Feb 20 '19 at 02:51
Then you have to at least `sched_yield` there - you're not leaving the other process any opportunities to acquire the lock. In general, you should look into implementing a so called "disruptor" algorithm if your latency requirements are so harsh: http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf (check out page 8 for the general idea; several c++ ports of this idea also exist). — oakad, Feb 20 '19 at 03:42

Improving Performance of Shared Memory - Isolating Linux Core

0 Answers0