2

The title is somewhat obscure. Here is the explanation:

I have 2 thread model. 1 thread is incrementing a variable inside a busy loop, the other one reads counter t1, does the measurement, reads the counter again t2 and stores the difference in an array for future printing.

Why are you not using rdtscp? It is serializing and it is already built in the hardware as an instruction.

Well, rdtscp is not good enough for my measurements. I need a 1-2 cycle resolution for my case.

Here is the pseudo-code of what I have done, and what is my problem:

void* counter_thread(void *input){

    uint64_t* p_counter = (uint64_t *)input;
    set affinity();

    while(1)
        (*p_counter)++;
}

int main(){
    setaffinity();
    warmup();
    uint64_t measurements[1000]; // for storing information
    register uint64_t t1,t2;
    for(int i = 0; i < 1000; i ++){

        mfence();
        t1 = counter;
        // for now, it is empty
        mfence();
        t2 = counter;
        measurements[i] = t2 - t1;
    }

    printf("measurements\n");    
    for(int i = ITER - 20; i < ITER; i++){
        printf("%d:%ld\n",i, measurements[i]);
    }
}

So, with this, for two consecutive reads, I find the difference as 9-10. It is good, but I need better accuracy than this.

For now, my problem is not related to getting a better accuracy. My problem is, if I change the code to this:

    int main(){
        // all same as above        
        printf("measurements\n");    
        for(int i = ITER - 20; i < ITER; i++){
            printf("%d:%ld\n",i, measurements[i]);
        }
        printf("measurements\n");    
        for(int i = ITER - 20; i < ITER; i++){
            printf("%d:%ld\n",i, measurements[i]);
        }
        printf("measurements\n");    
        for(int i = ITER - 20; i < ITER; i++){
            printf("%d:%ld\n",i, measurements[i]);
        }
    }

This gives 50-60 as difference. Why is this the case?

I have disabled ASLR to make sure that they are placed to the same or at least very close physical addresses and they will hit same cache's. I am also running on an isolated core (isolated from other user processes, enabled by providing grub parameter: isolcpu) to get rid of any noise.

I have checked both of the codes assemblies. They look almost the same:

Assembly outputs for given sections

    // counter
    0000000000000aaf <J1>:
    aaf:    48 83 00 01             addq   $0x1,(%rax)
    ab3:    eb fa                   jmp    aaf <J1>
    ab5:    90                      nop
    // ...
    // measurement
    c34:    eb 3e                   jmp    c74 <main+0x185>
    c36:    0f ae f0                mfence 
    c39:    48 8b 9d 38 ff ff ff    mov    -0xc8(%rbp),%rbx
    c40:    0f ae f0                mfence 
    c43:    4c 8b a5 38 ff ff ff    mov    -0xc8(%rbp),%r12
    c4a:    8b 85 30 ff ff ff       mov    -0xd0(%rbp),%eax
    c50:    48 98                   cltq   
    // some instructions for storing, which are identical in both cases
    c7b:    42 0f 00 
    c7e:    7e b6                   jle    c36 <main+0x147>

Full code in C

As a full example, here is my code. Feel free to use it and share your experiences. I am stuck at this point. I have no knowledge about why they give different results.


    #define _GNU_SOURCE
    #include <stdio.h>
    #include <sched.h> //cpu_set_t , CPU_SET
    #include <unistd.h>
    #include <stdint.h>
    #include <stdlib.h>
    #include <pthread.h> //pthread_t
    #include <errno.h> // EINVAL
    #include <string.h>

    #define ITER 1000
    // I have disabled core 6 and 7 on my computer.
    // They are siblings, they reside in the same physical core.
    #define COUNTER_THREAD 6    
    #define MEASUREMENT_THREAD 7


    void* counter_thread(void *input){

        uint64_t* p_counter = (uint64_t *)input;
        // set affinity
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(COUNTER_THREAD, &cpuset);

        if(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset)){
            fprintf(stderr, "Error while setting affinity on counter thread\n");
        }

        // busy loop, same as: 
        // while(1) (*p_counter)++;
        // I am using assembly because of the C code (while(1) stuff
        // might add other instructions between addition and jump
        // I have also tried to copy-paste the addq instruction so that
        // I will have fewer jumps and more add, but again, this is also
        // changes the measurement from 10 to 50. So, whenever I touch the code
        // I get way less accuracy. 
        asm volatile(
            "J1:\n"
            "addq $1, %0\n"
            "jmp J1\n"
            :"+m"(*p_counter)::);

    }

    void warmup(){
        for(volatile int i = 0; i < 10000; i++){}
    }

    int main(){

        uint64_t counter = 0;
        uint64_t* p_counter = &counter;
        pthread_t ctr_tr;
        pthread_create(&ctr_tr, NULL, counter_thread, (void*)p_counter);

        uint64_t* measurements = malloc(sizeof(uint64_t) * (ITER +1));


        // init self thread
        cpu_set_t cpu;
        CPU_ZERO(&cpu);
        CPU_SET(MEASUREMENT_THREAD, &cpu);
        if(pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu)){
            fprintf(stderr, "Could not assign affinity measurement thread\n");
        }

        warmup();
        register uint64_t t1,t2;
        for(int i = 0; i < ITER; i ++){

            mfence();
            t1 = counter;

            mfence();
            t2 = counter;
            measurements[i] = t2 - t1;
        }
        // print only last 20
        printf("measurements\n");    
        for(int i = ITER - 20; i < ITER; i++){
            printf("%d:%ld\n",i, measurements[i]);
        }
        for(int i = ITER - 20; i < ITER; i++){
            printf("%d:%ld\n",i, measurements[i]);
        }
        for(int i = ITER - 20; i < ITER; i++){
            printf("%d:%ld\n",i, measurements[i]);
        }
        // used below to kill busy loop, don't know if it still works.
        pthread_cancel(ctr_tr);
        free(measurements);
        return 0;
    }

Compilation

I am compiling using: gcc main.c -O0 -pthread -o main

Bonus error I have I also have a segmentation fault when main exits. It is related to some allocation that I cannot find. It is not affecting the execution, and it is not my main problem at the moment.

rdtscp results

FYI, I also used rdtscp. In my machine it gives 108-144 difference for just this loop:

        for(int i = 0; i < ITER; i ++){

            mfence();
            asm volatile(
            "rdtscp"
            :"=a"(t1)::"rcx","rdx");

            mfence();
            asm volatile(
            "rdtscp"
            :"=a"(t2)::"rcx","rdx");
            measurements[i] = t2 - t1;
        }

Summary:

I have two thread model. One thread is running in an infinite loop incrementing a variable. The other one is reading this value t1, do a job (for now, it doesn't even do a job), and read the value again t2, and store the differences. I get ~10 difference between two reads. The problem is, after the loop is done, I am printing the values. If I add more lines of codes after the measurement loop is done, my measurements are messed up (instead of 10, it gives 50).

Bonus experiment: If I add array access between t1 and t2 (like a pseudo measurement function) I can see that the difference between two reads increases to 11 or 12. That is what I want because it adds 1 more instruction (array access) and I can say that the offset for two measurements is 10 cycles and can calculate other experiments according to that. However, I am stuck at a point that, if I add more instructions after the measurement, it changes measurement. I need to fix this first.

EDIT

So, instead of compiling with O0, I have compiled using Os and it seems that I always get 10 cycles between 2 measurements. My guessing is it is about alignment because Os moves the main() function above of other functions. Still, I cannot find the best explanation and even though it is the case, I still don't have an answer to my initial question. If I add another print line at the end of the main, the rest of the function stays at the same virtual address, but I still don't get a good result.

0 Answers0