I'm trying to optimize the performance of reading and writing a double
to shared memory. I have one program writing to shared memory, and another reading from it.
I've used this post to help isolate CPUs for these two programs to run on, with the following line in my etc/default/grub
file:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_idle.max_cstate=1 isolcpus=6,7"
I am using taskset -c 6 writer
and taskset -c 7 reader
to set these programs to run on these cpus.
Using this man page on sched_setscheduler, I have set up both programs to have the highest scheduling priority using the following code:
struct sched_param param;
param.sched_priority = sched_get_priority_max(SCHED_FIFO);
if(sched_setscheduler(0, SCHED_FIFO, ¶m) == -1)
{
perror("sched_setscheduler failed");
exit(-1);
}
I have defined a struct to be used in shared memory that contains the required synchronization tools, as well as a timespec struct and a double to pass between the two programs, as follows:
typedef struct
{
// Synchronization objects
pthread_mutex_t ipc_mutex;
sem_t ipc_sem;
// Shared data
double value;
volatile int read_cond;
volatile int end_cond;
double start_time;
struct timespec ts;
} shared_data_t;
Shared Memory Initialization:
Writer:
// ftok to generate unique key
key_t key = ftok("shmfile",65);
// shmget returns an identifier in shmid
int shmid = shmget(key,1024,0666|IPC_CREAT);
ftruncate(shmid, sizeof(shared_data_t));
// shmat to attach to shared memory
shared_data_t* sdata = (shared_data_t*) shmat(shmid,(void*)0,0);
sdata->value = 0;
Reader:
// ftok to generate unique key
key_t key = ftok("shmfile",65);
// shmget returns an identifier in shmid
int shmid = shmget(key,1024,0666|IPC_CREAT);
ftruncate(shmid, sizeof(shared_data_t));
// shmat to attach to shared memory
shared_data_t* sdata = (shared_data_t*) shmat(shmid,(void*)0,0);
Initialization of Synchronization Tools in Writer
pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&sdata->ipc_mutex, &mutex_attr);
sem_init(&sdata->ipc_sem, 1, 0);
Write Code
for (int i = 0; i < 20000000; ++i)
{
pthread_mutex_lock(&sdata->ipc_mutex);
sdata->value++;
clock_gettime(CLOCK_MONOTONIC, &sdata->ts);
sdata->start_time = (BILLION*sdata->ts.tv_sec) + sdata->ts.tv_nsec;
sdata->read_cond = 1;
pthread_mutex_unlock(&sdata->ipc_mutex);
sem_wait(&sdata->ipc_sem);
}
fprintf(stderr, "done writing\n" );
pthread_mutex_lock(&sdata->ipc_mutex);
sdata->end_cond = 1;
pthread_mutex_unlock(&sdata->ipc_mutex);
Read Code
double counter = 0;
double total_time = 0;
double max_time = 0;
double min_time = BILLION;
double max_thresh = 1000;
int above_max_counter = 0;
double last_val = 0;
while (1) {
pthread_mutex_lock(&sdata->ipc_mutex);
while (!sdata->read_cond && !sdata->end_cond) {
pthread_mutex_unlock(&sdata->ipc_mutex);
pthread_mutex_lock(&sdata->ipc_mutex);
}
clock_gettime(CLOCK_MONOTONIC, &sdata->ts);
double time_to_read = (BILLION*sdata->ts.tv_sec) + sdata->ts.tv_nsec - sdata->start_time;
if (sdata->end_cond) {
break;
}
if (sdata->value != last_val + 1) {
fprintf(stderr, "synchronization error: val: %g, last val: %g\n", sdata->value, last_val);
}
last_val = sdata->value;
if (time_to_read > max_time) {
max_time = time_to_read;
printf("max time: %lf, counter: %ld\n", max_time, (long int) counter);
}
if (time_to_read < min_time) min_time = time_to_read;
if (time_to_read > max_thresh) above_max_counter++;
total_time += time_to_read;
counter++;
sdata->read_cond = 0;
sem_post(&sdata->ipc_sem);
pthread_mutex_unlock(&sdata->ipc_mutex);
}
fprintf(stderr, "avg time to read: %g\n", total_time / counter);
fprintf(stderr, "max time to read: %g\n", max_time);
fprintf(stderr, "min time to read: %g\n", min_time);
fprintf(stderr, "count above max threshhold of %g ns: %d\n", max_thresh, above_max_counter);
Cleanup in Writer
//detach from shared memory
shmdt(sdata);
Cleanup in Reader
pthread_mutex_unlock(&sdata->ipc_mutex);
pthread_mutex_destroy(&sdata->ipc_mutex);
//detach from shared memory
shmdt(sdata);
// destroy the shared memory
shmctl(shmid,IPC_RMID,NULL);
The goal is to minimize the amount of time spent between these two operations. Ideally, I would like to be able to guarantee that the time to read from the time the value is written is less than 1 microsecond. However, the output I get:
max time: 5852.000000, counter: 0
max time: 18769.000000, counter: 30839
max time: 27416.000000, counter: 66632
max time: 28668.000000, counter: 1820109
max time: 121362.000000, counter: 1853346
done writing
avg time to read: 277.959
max time to read: 121362
min time to read: 60
count above max threshhold of 1000 ns: 1871
indicates that there are a number of times (~.01% of reads) where the read exceeds 1 us, and can go as high as 121us.
My question is as follows:
What could be causing these spikes, since I have set the priority to highest and isolated the CPU on which these programs are running?
I have learned from this post that I should not expect clock_gettime to have nanosecond accuracy. Are these spikes simply inaccuracies in clock_gettime?
The other option I considered is that these cores (6 and 7) are being interrupted somehow, despite having been set as highest priority.
Any help would be greatly appreciated.
EDIT
Per comment below, here is the contents of my /proc/interrupts
file:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 20 0 0 0 0 0 0 0 IO-APIC 2-edge timer
1: 2 0 0 0 0 0 0 0 IO-APIC 1-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 2 0 0 0 1 1 0 0 IO-APIC 12-edge i8042
16: 0 0 0 0 0 0 0 0 IO-APIC 16-fasteoi i801_smbus, pcim_das1602_16
19: 2 0 0 0 8 10 6 2 IO-APIC 19-fasteoi
120: 0 0 0 0 0 0 0 0 PCI-MSI 16384-edge aerdrv
121: 99 406 0 0 14 5960 6 0 PCI-MSI 327680-edge xhci_hcd
122: 8726 133 47 28 4126 3910 22638 795 PCI-MSI 376832-edge ahci[0000:00:17.0]
123: 2 0 0 0 2 0 3 3663 PCI-MSI 520192-edge eno1
124: 3411 0 2 1 176 24498 77 11 PCI-MSI 32768-edge i915
125: 45 0 0 0 3 6 0 0 PCI-MSI 360448-edge mei_me
126: 432 0 0 0 144 913 28 1 PCI-MSI 514048-edge snd_hda_intel:card0
NMI: 1 1 1 1 1 1 1 1 Non-maskable interrupts
LOC: 12702 10338 10247 10515 9969 10386 16658 13568 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 1 1 1 1 1 1 1 1 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 IRQ work interrupts
RTR: 7 0 0 0 0 0 0 0 APIC ICR read retries
RES: 4060 2253 1026 708 595 846 887 751 Rescheduling interrupts
CAL: 11906 10423 11418 9894 14562 11000 21479 11223 Function call interrupts
TLB: 10620 8996 10060 8674 13172 9622 20121 9838 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 2 2 2 2 2 2 2 2 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 0 0 0 0 Posted-interrupt notification event
PIW: 0 0 0 0 0 0 0 0 Posted-interrupt wakeup event
I've tried changing the smp affinity for interrupts 122 and 123 to cores 0 and 1, per this post, which appears to do nothing, as when I reset my computer, these affinities are still set to cores 6 and 7, respectively.
Even without resetting and simply re-running my programs(s), I see no change in the number of interrupts serviced by these CPU cores.