The title is somewhat obscure. Here is the explanation:
I have 2 thread model. 1 thread is incrementing a variable inside a busy loop, the other one reads counter t1
, does the measurement, reads the counter again t2
and stores the difference in an array for future printing.
Why are you not using
rdtscp
? It is serializing and it is already built in the hardware as an instruction.
Well, rdtscp
is not good enough for my measurements. I need a 1-2 cycle resolution for my case.
Here is the pseudo-code of what I have done, and what is my problem:
void* counter_thread(void *input){
uint64_t* p_counter = (uint64_t *)input;
set affinity();
while(1)
(*p_counter)++;
}
int main(){
setaffinity();
warmup();
uint64_t measurements[1000]; // for storing information
register uint64_t t1,t2;
for(int i = 0; i < 1000; i ++){
mfence();
t1 = counter;
// for now, it is empty
mfence();
t2 = counter;
measurements[i] = t2 - t1;
}
printf("measurements\n");
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
}
So, with this, for two consecutive reads, I find the difference as 9-10. It is good, but I need better accuracy than this.
For now, my problem is not related to getting a better accuracy. My problem is, if I change the code to this:
int main(){
// all same as above
printf("measurements\n");
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
printf("measurements\n");
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
printf("measurements\n");
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
}
This gives 50-60 as difference. Why is this the case?
I have disabled ASLR
to make sure that they are placed to the same or at least very close physical addresses and they will hit same cache's. I am also running on an isolated core (isolated from other user processes, enabled by providing grub parameter: isolcpu
) to get rid of any noise.
I have checked both of the codes assemblies. They look almost the same:
Assembly outputs for given sections
// counter
0000000000000aaf <J1>:
aaf: 48 83 00 01 addq $0x1,(%rax)
ab3: eb fa jmp aaf <J1>
ab5: 90 nop
// ...
// measurement
c34: eb 3e jmp c74 <main+0x185>
c36: 0f ae f0 mfence
c39: 48 8b 9d 38 ff ff ff mov -0xc8(%rbp),%rbx
c40: 0f ae f0 mfence
c43: 4c 8b a5 38 ff ff ff mov -0xc8(%rbp),%r12
c4a: 8b 85 30 ff ff ff mov -0xd0(%rbp),%eax
c50: 48 98 cltq
// some instructions for storing, which are identical in both cases
c7b: 42 0f 00
c7e: 7e b6 jle c36 <main+0x147>
Full code in C
As a full example, here is my code. Feel free to use it and share your experiences. I am stuck at this point. I have no knowledge about why they give different results.
#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h> //cpu_set_t , CPU_SET
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <pthread.h> //pthread_t
#include <errno.h> // EINVAL
#include <string.h>
#define ITER 1000
// I have disabled core 6 and 7 on my computer.
// They are siblings, they reside in the same physical core.
#define COUNTER_THREAD 6
#define MEASUREMENT_THREAD 7
void* counter_thread(void *input){
uint64_t* p_counter = (uint64_t *)input;
// set affinity
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(COUNTER_THREAD, &cpuset);
if(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset)){
fprintf(stderr, "Error while setting affinity on counter thread\n");
}
// busy loop, same as:
// while(1) (*p_counter)++;
// I am using assembly because of the C code (while(1) stuff
// might add other instructions between addition and jump
// I have also tried to copy-paste the addq instruction so that
// I will have fewer jumps and more add, but again, this is also
// changes the measurement from 10 to 50. So, whenever I touch the code
// I get way less accuracy.
asm volatile(
"J1:\n"
"addq $1, %0\n"
"jmp J1\n"
:"+m"(*p_counter)::);
}
void warmup(){
for(volatile int i = 0; i < 10000; i++){}
}
int main(){
uint64_t counter = 0;
uint64_t* p_counter = &counter;
pthread_t ctr_tr;
pthread_create(&ctr_tr, NULL, counter_thread, (void*)p_counter);
uint64_t* measurements = malloc(sizeof(uint64_t) * (ITER +1));
// init self thread
cpu_set_t cpu;
CPU_ZERO(&cpu);
CPU_SET(MEASUREMENT_THREAD, &cpu);
if(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu)){
fprintf(stderr, "Could not assign affinity measurement thread\n");
}
warmup();
register uint64_t t1,t2;
for(int i = 0; i < ITER; i ++){
mfence();
t1 = counter;
mfence();
t2 = counter;
measurements[i] = t2 - t1;
}
// print only last 20
printf("measurements\n");
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
for(int i = ITER - 20; i < ITER; i++){
printf("%d:%ld\n",i, measurements[i]);
}
// used below to kill busy loop, don't know if it still works.
pthread_cancel(ctr_tr);
free(measurements);
return 0;
}
Compilation
I am compiling using:
gcc main.c -O0 -pthread -o main
Bonus error I have I also have a segmentation fault when main exits. It is related to some allocation that I cannot find. It is not affecting the execution, and it is not my main problem at the moment.
rdtscp results
FYI, I also used rdtscp
. In my machine it gives 108-144 difference for just this loop:
for(int i = 0; i < ITER; i ++){
mfence();
asm volatile(
"rdtscp"
:"=a"(t1)::"rcx","rdx");
mfence();
asm volatile(
"rdtscp"
:"=a"(t2)::"rcx","rdx");
measurements[i] = t2 - t1;
}
Summary:
I have two thread model. One thread is running in an infinite loop incrementing a variable. The other one is reading this value t1
, do a job (for now, it doesn't even do a job), and read the value again t2
, and store the differences.
I get ~10 difference between two reads.
The problem is, after the loop is done, I am printing the values. If I add more lines of codes after the measurement loop is done, my measurements are messed up (instead of 10, it gives 50).
Bonus experiment:
If I add array access between t1
and t2
(like a pseudo measurement function) I can see that the difference between two reads increases to 11 or 12. That is what I want because it adds 1 more instruction (array access) and I can say that the offset for two measurements is 10 cycles and can calculate other experiments according to that. However, I am stuck at a point that, if I add more instructions after the measurement, it changes measurement. I need to fix this first.
EDIT
So, instead of compiling with O0
, I have compiled using Os
and it seems that I always get 10 cycles between 2 measurements. My guessing is it is about alignment because Os
moves the main()
function above of other functions. Still, I cannot find the best explanation and even though it is the case, I still don't have an answer to my initial question. If I add another print line at the end of the main, the rest of the function stays at the same virtual address, but I still don't get a good result.