Why does this simple set of for loops (arg max) run much slower when the input size is near a multiple of 128?

Question

I'm running into an odd issue where reducing the size of an input to a basic operation in a set of nested for loops causes an increase in runtime. Playing with it further, I've found it seems to be due to a notable increase in run time when the input size is, or near, a multiple of 128.

The function:

float strangeIssue(size_t patchLength)
{
    using LabelType = unsigned char;
    LabelType* emptyPatch = new LabelType[patchLength * patchLength * patchLength];

    LabelType numClasses = 133;
    size_t arraySize = numClasses * patchLength * patchLength * patchLength;
    float* modelPred = new float[arraySize];

    clock_t begin_time = clock();
    size_t predInd = 0;
    float currVal, maxVal;
    LabelType label;
    size_t sliceSize = patchLength * patchLength;
    size_t volumeSize = patchLength * patchLength * patchLength;
    for (size_t x = 0; x < patchLength; x++) {
        for (size_t y = 0; y < patchLength; y++) {
            for (size_t z = 0; z < patchLength; z++) {
                predInd = x * sliceSize + y * patchLength + z;
                maxVal = modelPred[predInd];
                label = 0;
                for (LabelType classInd = 1; classInd < numClasses; classInd++) {
                    size_t voxelClassInd = predInd + classInd * volumeSize;
                    currVal = modelPred[voxelClassInd];
                    if (currVal > maxVal) {
                        label = classInd;
                        maxVal = currVal;
                    }
                }
                emptyPatch[predInd] = label;
            }
        }
    }

    float totalTime = (float)(clock() - begin_time);

    delete[] modelPred;
    delete[] emptyPatch;

    return totalTime / CLOCKS_PER_SEC;
}

calling it

std::vector<size_t> patchLengths({104, 112, 120, 127, 128, 129, 136, 144, 152, 160, 168, 176, 184, 240, 248, 255, 256, 257, 264, 272});
for (size_t patchLength : patchLengths) {
    std::cout << "patchLength " << patchLength << " time in secs " << strangeIssue2(patchLength) << std::endl;
}

and the output (plus comments)

patchLength 104 time in secs 0.638
patchLength 112 time in secs 0.776
patchLength 120 time in secs 0.791
patchLength 127 time in secs 1.639 <--- not ~0.8xx?
patchLength 128 time in secs 2.175 <--- really?
patchLength 129 time in secs 1.596 <--- still pretty long 
patchLength 136 time in secs 1.053 <--- getting back to expected
patchLength 144 time in secs 1.339
patchLength 152 time in secs 1.454
patchLength 160 time in secs 1.9
patchLength 168 time in secs 1.958
patchLength 176 time in secs 2.435
patchLength 184 time in secs 2.599

patchLength 240 time in secs 6.263
patchLength 248 time in secs 6.458
patchLength 255 time in secs 13.274  <--- why?
patchLength 256 time in secs 26.321  <--- wow!
patchLength 257 time in secs 13.764  <--- long
patchLength 264 time in secs 7.86    <--- ok
patchLength 272 time in secs 9.151

So the time is monotonically increasing until around 128. 128 takes longer than 168, while 129 and even 136 are much less. Around 256, both 255 and 257 take longer than expected, though still half that of 256. Wall clock time (i.e. std::chrono::high_resolution_clock) has a similar pattern.

If I understand the standard right, modelPred could be initialized to anything, but when running this I've found the condition (currVal > maxVal) is never triggered. However, if I comment the if statement out, the compiler seems to be smart enough to skip the innermost for loop entirely and all times are ~0.

I'm seeing this with Release builds from VS 2017 and VS 2019 on windows, and gcc on ubuntu, and on a couple different machines.

Possibly related: [Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?](https://stackoverflow.com/q/11413855/12149471) — Andreas Wenzel, Nov 17 '21 at 19:26
`size_t voxelClassInd = predInd + classInd * volumeSize; currVal = modelPred[voxelClassInd];` for `size_t volumeSize = patchLength * patchLength * patchLength;`, hoboy. — EOF, Nov 17 '21 at 19:31
@AndreasWenzel yes that seems to be the same underlying issue, thanks! — Ken, Nov 17 '21 at 19:44
Pretty sure @AndreasWenzel is right. Why don't you download and install "Intel VTune Profiler". It gives you a _lot_ more insight into performance issues than the VisualStudio integrated profiler. VTune will measure hardware issues like cache misses etc... and should explain what you are seeing. — Hajo Kirchhoff, Nov 17 '21 at 19:49

Why does this simple set of for loops (arg max) run much slower when the input size is near a multiple of 128?

0 Answers0