I'm optimizing a matrix numerical hotspot.
Currently, I'm doing blocking and loop unrolling to improve performance. However, I deliberately avoid peeling the borders. Instead I let the blocking steps overflow, and of course, the algorithm then touches uninitialized values.
However, the matrix is generously pre-allocated to cope with the overflow so I am not actually illegally accessing a memory location.
I don't do peeling for several reasons:
- Laziness
- Performance hit due to the very bad locality of the peeling border case.
- To avoid complex border peeling code.
However, I am wondering whether these overflowed accesses that touch uninitialized value(s) would actually cause a performance hit?
I predictably know where the uninitialized accesses happen and they are also reported via valgrind. I have also profiled the code using Intel's VTune and could not see any signs that would point to a degraded performance due to this.