3

I'm using Intel Pin to dynamically instrument the multi-threaded programs to do some data race detection. I instrument memory read/write instructions to collect memory traces at runtime and then analyze the log. The trace collection is simple, which stores the memory traces (including time, thread id, address, .etc) to a buffer at runtime and writes it out in the end.

VOID PIN_FAST_ANALYSIS_CALL RecordMemRead(unsigned int  ip, unsigned int  addr, THREADID tid){
    PIN_GetLock(&lock,tid+1);

    membuf[instCounter].tid = tid;
    membuf[instCounter].ip = ip;
    membuf[instCounter].addr = addr;
    membuf[instCounter].op = 'R';
    instCounter++;

    PIN_ReleaseLock(&lock);
}

VOID PIN_FAST_ANALYSIS_CALL RecordMemWrite(unsigned int  ip, unsigned int   addr, THREADID tid){
  // similar to RecordMemRead()
}

VOID Instruction(INS ins, VOID *v){
    if(INS_IsBranchOrCall(ins)) 
        return;
    if(INS_IsStackRead(ins))
        return;
    if(INS_IsStackWrite(ins))
        return;  

    if (INS_IsMemoryRead(ins)){
        INS_InsertPredicatedCall(ins, IPOINT_BEFORE, (AFUNPTR)RecordMemRead,  IARG_FAST_ANALYSIS_CALL, IARG_INST_PTR, IARG_MEMORYREAD_EA, 
          IARG_THREAD_ID, IARG_END);
    }

    else if(INS_IsMemoryWrite(ins)){
        INS_InsertPredicatedCall(ins, IPOINT_BEFORE, (AFUNPTR)RecordMemWrite, IARG_FAST_ANALYSIS_CALL, IARG_INST_PTR, IARG_MEMORYWRITE_EA, 
          IARG_THREAD_ID, IARG_END);
    }
}

My trouble is the severe runtime overhead (200x - 500x). According to other works, the trace collection should only introduce less than 100x overhead. I have tried to optimize it by skipping the accesses to the stack, but it doesn't help much. Since my instrumentation is at a granularity of instruction, large numbers of accesses are logged. Thus, I think the only way to reduce the runtime overhead is to reduce the accesses to be collected, aka only recording the accesses to the shared variables between the threads (the race-related ones).

Can I by some means to figure out which accesses are to the shared variables in Pin? or are there any other ways to reduce the runtime overhead?

  • Did you tried to `INS_InsertCall` rather than the predicated one? Also, the [fast buffering API](https://software.intel.com/sites/landingpage/pintool/docs/97619/Pin/html/index.html#Buffering) is the way to go when your pin tool is very slow. It has a significant lower impact on performances at the expense of being harder to implement though... Make sure you also do a thorough review of [the performance considerations](https://software.intel.com/sites/landingpage/pintool/docs/97619/Pin/html/index.html#PERFORMANCE) as explained in the doc. – Neitsa Oct 08 '18 at 16:56
  • I tried `INS_InsertCall()` before, but it has the opposite effect as it inserts all the instructions including the ones that are predicated when the predicate is false. – Pengfei Wang Oct 11 '18 at 14:27
  • Fast buffering is a good suggestion. I have just tried it as the example in Pin shows, but it still incurs significant overhead. The overhead is mainly introduced when dumping the traces to the log file. The dump first converts the buffered traces to strings, then writes them to the file one by one, which is very slow. Thus, I changed it to directly writing the whole buffer to the file without converting it to string, which siginificantly improves the efficiency. The overhead now is only about 10-20x. The only problem now is the difficulty in analyzing the log as it is not writen in strings. – Pengfei Wang Oct 11 '18 at 14:43
  • excellent :) this is what we do at work; we usually use the fast buffering API and produce a binary file of the trace which is then "injected" in a database so we can query address range, reads, writes, etc. We also have a few converters that dumps the binary traces to text files (nothing to do with PIN in this case, the goal is just to read a binary file and output a text file), keeping only the information that is needed for a quick review. I guess now you just have to write the "converter". – Neitsa Oct 12 '18 at 07:54
  • 3
    Yes, I have finished a converter in Python. I use Python to analyze the log file, the binary can be converted to text by the `unpack()` function of the `struct` package. Many thanks! – Pengfei Wang Oct 12 '18 at 15:50

0 Answers0