How to write Linux C++ debug information when performance is critical?

Question

I'm trying to debug a rather large program with many variables. The code is setup in this way:

while (condition1) {
    //non timing sensitive code
    while (condition2) {
    //timing sensitive code
    //many variables that change each iteration
    }
}

I have many variables on the inner loop that I want to save for viewing. I want to write them to a text file each outer loop iteration. The inner loop executes a different number of times each iteration. It can be just 2 or 3, or it can be several thousands.

I need to see all the variables values from each inner iteration, but I need to keep the inner loop as fast as possible.

Originally, I tried just storing each data variable in its own vector where I just appended a value at each inner loop iteration. Then, when the outer loop iteration came, I would read from the vectors and write the data to a debug file. This quickly got out of hand as variables were added.

I thought about using a string buffer to store the information, but I'm not sure if this is the fastest way given strings would need to be created multiple times within the loop. Also, since I don't know the number of iterations, I'm not sure how large the buffer would grow.

With the information stored being in formats such as:

"Var x: 10\n
Var y: 20\n
.
.
.
Other Text: Stuff\n"

So, is there a cleaner option for writing large amounts of debug data quickly?

Could you just kick this data to a separate thread which could do the logging for you? — tadman, Mar 29 '16 at 21:53
You may defer the actual writing to the file to a separate thread, and just pass the debug information there in memory. — πάντα ῥεῖ, Mar 29 '16 at 21:53
So this is some kind of real-time program that would behave differently if it runs slower? So you can't just have gdb single-step and `display` everything? — Peter Cordes, Mar 30 '16 at 02:48
WHy are you formatting on write (time sensitive) instead of on read ? Humans can't read much more than 100 numbers per second anyway. — MSalters, Mar 30 '16 at 09:22

score 2 · Accepted Answer · edited May 23 '17 at 12:15

If it's really time-sensitive, then don't format strings inside the critical loop.

I'd go for appending records to a log buffer of binary records inside the critical loop. The outer loop can either write that directly to a binary file (which can be processed later), or format text based on the records.

This has the advantage that the loop only needs to track a couple extra variables (pointers to the end of used and allocated space of one std::vector), rather than two pointers for a std::vector for every variable being logged. This will have much lower impact on register allocation in the critical loop.

In my testing, it looks like you just get a bit of extra loop overhead to track the vector, and a store instruction for every variable you want to log. I didn't write a big enough test loop to expose any potential problems from keeping all the variables "alive" until the emplace_back(). If the compiler does a bad job with bigger loops where it needs to spill registers, see the section below about using a simple array without any size checking. That should remove any constraint on the compiler that makes it try to do all the stores into the log buffer at the same time.

Here's an example of what I'm suggesting. It compiles and runs, writing a binary log file which you can hexdump.

See the source and asm output with nice formatting on the Godbolt compiler explorer. It can even colourize source and asm lines so you can more easily see which asm comes from which source line.

#include <vector>
#include <cstdint>
#include <cstddef>
#include <iostream>

struct loop_log {
    // Generally sort in order of size for better packing.
    // Use as narrow types as possible to reduce memory bandwidth.
    // e.g. logging an int loop counter into a short log record is fine if you're sure it always in-practice fits in a short, and has zero performance downside
    int64_t x, y, z;
    uint64_t ux, uy, uz;
    int32_t a, b, c;
    uint16_t t, i, j;
    uint8_t c1, c2, c3;
    // isn't there a less-repetitive way to write this?
    loop_log(int64_t x, int32_t a, int outer_counter, char c1)
      : x(x), a(a), i(outer_counter), c1(c1)
        // leaves other members *uninitialized*, not zeroed.
        // note lack of gcc warning for initializing uint16_t i from an int
        // and for not mentioning every member
      {}
};


static constexpr size_t initial_reserve = 10000;

// take some args so gcc can't count the iterations at compile time
void foo(std::ostream &logfile, int outer_iterations, int inner_param) {
    std::vector<struct loop_log> log;
    log.reserve(initial_reserve);
    int outer_counter = outer_iterations;

    while (--outer_counter) {
        //non timing sensitive code
        int32_t a = inner_param - outer_counter;
        while (a != 0) {
            //timing sensitive code
            a <<= 1;
            int64_t x = outer_counter * (100LL + a);
            char c1 = x;

            // much more efficient code with gcc 5.3 -O3  than push_back( a struct literal );
            log.emplace_back(x, a, outer_counter, c1);
        }
        const auto logdata = log.data();
        const size_t bytes = log.size() * sizeof(*logdata);
        // write group size, then a group of records
        logfile.write( reinterpret_cast<const char *>(&bytes), sizeof(bytes) );
        logfile.write( reinterpret_cast<const char *>(logdata), bytes );
        // you could format the records into strings at this point if you want
        log.clear();
    }
}


#include <fstream>
int main() {
    std::ofstream logfile("dbg.log");
    foo(logfile, 100, 10);
}

gcc's output for foo() pretty much optimizes away all the vector overhead. As long as the initial reserve() is big enough, the inner loop is just:

## gcc 5.3 -masm=intel  -O3 -march=haswell -std=gnu++11 -fverbose-asm
## The inner loop from the above C++:
.L59:
        test    rbx, rbx        # log      // IDK why gcc wants to check for a NULL pointer inside the hot loop, instead of doing it once after reserve() calls new()
        je      .L6 #,
        mov     QWORD PTR [rbx], rbp      # log_53->x, x     // emplace_back the 4 elements
        mov     DWORD PTR [rbx+48], r12d  # log_53->a, a
        mov     WORD PTR [rbx+62], r15w   # log_53->i, outer_counter
        mov     BYTE PTR [rbx+66], bpl    # log_53->c1, x
.L6:
        add     rbx, 72   # log,                // struct size is 72B
        mov     r8, r13   # D.44727, log
        test    r12d, r12d      # a
        je      .L58        #,                  // a != 0
.L4:
        add     r12d, r12d        # a           // a <<= 1
        movsx   rbp, r12d     # D.44726, a      // x = ...
        add     rbp, 100  # D.44726,            // x = ...
        imul    rbp, QWORD PTR [rsp+8]  # x, %sfp   // x = ...
        cmp     r14, rbx  # log$D40277$_M_impl$_M_end_of_storage, log
        jne     .L59      #,                    // stay in this tight loop as long as we don't run out of reserved space in the vector

       // fall through into code that allocates more space and copies.
       //  gcc generates pretty lame copy code, using 8B integer loads/stores, not rep movsq.  Clang uses AVX to copy 32B at a time
       // anyway, that code never runs as long as the reserve is big enough
       // I guess std::vector doesn't try to realloc() to avoid the copy if possible (e.g. if the following virtual address region is unused) :/

An attempt to avoid repetitive constructor code:

I tried a version that uses a braced initializer list to avoid having to write a really repetitive constructor, but got much worse code from gcc:

#ifdef USE_CONSTRUCTOR
            // much more efficient code with gcc 5.3 -O3.
            log.emplace_back(x, a, outer_counter, c1);
#else
            // Put the mapping from local var names to struct member names right here in with the loop
            log.push_back( (struct loop_log) {
              .x = x, .y =0, .z=0,    // C99 designated-initializers are a GNU extension to C++,
              .ux=0,  .uy=0, .uz=0,   // but gcc doesn't support leaving having uninitialized elements before the last initialized one:
              .a = a, .b=0,  .c=0,    //  without all the ...=0, you get "sorry, unimplemented: non-trivial designated initializers not supported"
              .t=0,   .i = outer_counter, .j=0,
              .c1 = (uint8_t)c1
            } );
#endif

This unfortunately stores a struct onto the stack and then copies it 8B at a time with code like:

    mov     rax, QWORD PTR [rsp+72]
    mov     QWORD PTR [rdx+8], rax    // rdx points into the vector's buffer
    mov     rax, QWORD PTR [rsp+80]
    mov     QWORD PTR [rdx+16], rax
    ...  // total of 9 loads/stores for a 72B struct

So it will have more impact on the inner loop.

There are a few ways to push_back() a struct into a vector, but using a braced-initializer-list unfortunately seems to always result in a copy that doesn't get optimized away by gcc 5.3. It would nice to avoid writing a lot of repetitive code for a constructor. And with designated initializer lists ({.x = val}), the code inside the loop wouldn't have to care much about what order the struct actually stores things. You could just write them in easy-to-read order.

BTW, .x= val C99 designated-initializer syntax is a GNU extension to C++. Also, you can get warnings for forgetting to initialize a member in a braced-list with gcc's -Wextra (which enables -Wmissing-field-initializers).

For more on syntax for initializers, have a look at Brace-enclosed initializer list constructor and the docs for member initialization.

This was a fun but terrible idea:

// Doesn't compiler.  Worse: hard to read, probably easy to screw up
while (outerloop) {
  int64_t x=0, y=1;
  struct loop_log {int64_t logx=x, logy=y;};  // loop vars as default initializers
  // error: default initializers can't be local vars with automatic storage.
  while (innerloop) { x+=y; y+=x;  log.emplace_back(loop_log()); }
}

Lower overhead from using a flat array instead of a `std::vector`

Perhaps trying to get the compiler to optimize away any kind of std::vector operation is less good than just making a big array of structs (static, local, or dynamic) and keeping a count yourself of how many records are valid. std::vector checks to see if you've used up the reserved space on every iteration, but you don't need anything like that if there is a fixed upper-bound you can use to allocate enough space to never overflow. (Depending on the platform and how you allocate the space, a big chunk of memory that's allocated but never written isn't really a problem. e.g. on Linux, malloc uses mmap(MAP_ANONYMOUS) for big allocations, and that gives you pages that are all copy-on-write mapped to a zeroed physical page. The OS doesn't need to allocate physical pages until you write, them. The same should apply to a large static array.)

So in your loop, you could just have code like

loop_log *current_record = logbuf;
while(inner_loop) {
    int64_t x = ...;
    current_record->x = x;
    ...
    current_record->i = (short)outer_counter;
    ...
    // or maybe
    // *current_record = { .x = x, .i = (short)outer_counter };
    // compilers will probably have an easier time avoiding any copying with a braced initializer list in this case than with vector.push_back

    current_record++;
}
size_t record_bytes = (current_record - log) * sizeof(log[0]);
// or size_t record_bytes = static_cast<char*>(current_record) - static_cast<char*>(log);
logfile.write((const char*)logbuf, record_bytes);

Scattering the stores throughout the inner loop will require the array pointer to be live all the time, but OTOH doesn't require all the loop variables to be live at the same time. IDK if gcc would optimize an emplace_back to store each variable into the vector once the variable was no longer needed, or if it might spill variables to the stack and then copy them all into the vector in one group of instructions.

Using log[records++].x = ... might lead to the compiler keeping the array and counter tying up two registers, since we'd use the record count in the outer loop. We want the inner loop to be fast, and can take the time to do the subtraction in the outer loop, so I wrote it with pointer increments to encourage the compiler to only use one register for that piece of state. Besides register pressure, base+index store instructions are less efficient on Intel SnB-family hardware than single-register addressing modes.

You could still use a std::vector for this, but it's hard to get std::vector not to write zeroes into memory it allocates. reserve() just allocates without zeroing, but you calling .data() and using the reserved space without telling vector about it with .resize() kind of defeats the purpose. And of course .resize() will initialize all the new elements. So you std::vector is a bad choice for getting your hands on a large allocation without dirtying it.

Thanks for the in depth response. This is way more information than I figured I would get. It should help me solve my problem. — Lucas, Mar 30 '16 at 14:45
@Lucas: Cheers :) I was curious to see what kind of asm you'd get from doing it the `std::vector` way, instead of the C way with a plain array. e.g. I assumed that the temporary+copy would optimize away for `push_back` on a simple POD type. Also, my C++ is rusty, so I found it interesting looking around for syntax options to save typing. — Peter Cordes, Mar 30 '16 at 14:51

score 0 · Answer 2 · answered Mar 29 '16 at 21:59

It sounds like what you really want is to look at your program from within a debugger. You haven't specified a platform, but if you build with debug information (-g using gcc or clang) you should be able to step through the loop when starting the program from within the debugger (gdb on linux.) Assuming you are on linux, tell it to break at the beginning of the function (break ) and then run. If you tell the debugger to display all the variables you want to see after each step or breakpoint hit, you'll get to the bottom of your problem in no time.

Regarding performance: unless you do something fancy like set conditional breakpoints or watch memory, running the program through the debugger will not dramatically affect perf as long as the program is not stopped. You may need to turn down the optimization level to get meaningful information though.

How to write Linux C++ debug information when performance is critical?

2 Answers2

An attempt to avoid repetitive constructor code:

Lower overhead from using a flat array instead of a std::vector

Lower overhead from using a flat array instead of a `std::vector`