Odd discrepancy in CPU vs Wall time measurements with malloc

Question

Answer: user and kernel cpu time is measured separately, be careful!

EDIT: pTimes is now reset to zero between each benchmark, but the results got weirder!

With the end goal of fiddling with my own custom memory management schemes, I made a simple benchmark for the existing malloc() in Visual Community 2019, Windows 10. Out of interest, I benchmarked the CPU times as well as wall times, and I tested malloc by allocating a large block of memory in many chunks, and then individually freeing each chunk without using them. See here:

void malloc_distr256(int nMemsize) {
    long long pFreeList[256];

    for (int i = 0; i < 256; ++i) pFreeList[i] = malloc(nMemsize >> 8);
    for (int i = 0; i < 256; ++i) free((void*)pFreeList[i]);
}

void malloc_distr64(int nMemsize) {
    long long pFreeList[64];

    for (int i = 0; i < 64; ++i) pFreeList[i] = malloc(nMemsize >> 6);
    for (int i = 0; i < 64; ++i) free((void*)pFreeList[i]);
}

void malloc_distr0(int nMemsize) {
    void* pMem = malloc(nMemsize);

    free(pMem);
}

These functions I benchmarked using the following code - "BenchTimes" is just a struct holding double CPU/wall times:

inline double cputime() {
    FILETIME lpCreationTime;
    FILETIME lpExitTime;
    FILETIME lpKernelTime;
    FILETIME lpUserTime;

    if (GetProcessTimes(GetCurrentProcess(), &lpCreationTime, &lpExitTime, &lpKernelTime, &lpUserTime)) {
        double dUnits = (double)(lpUserTime.dwLowDateTime | (long long)lpUserTime.dwHighDateTime << 32);
        return dUnits * 0.1;
    }
    else return 0xFFF0000000000000;
}

inline double walltime() {
    LARGE_INTEGER lnFreq, lnTime;

    if (QueryPerformanceFrequency(&lnFreq)) if (QueryPerformanceCounter(&lnTime))
        return 1000000.0 * (double)lnTime.QuadPart / (double)lnFreq.QuadPart;
//multiply by 1,000,000 to convert seconds to microseconds
//because the cpu time measurer I had in microseconds as well
    return 0.0;
}

void bench(void (pfnFunc)(int), int nMemsize, int nIters, int nReps, BenchTimes* pTimes) {
    pTimes->dCpuTime = 0.0;
    pTimes->dWallTime = 0.0;

    for (volatile int r = 0; r < nReps; ++r) {
        double dCpuStart = cputime();
        double dWallStart = walltime();

        for (volatile int i = 0; i < nIters; ++i) pfnFunc(nMemsize);

        double dCpuEnd = cputime();
        double dWallEnd = walltime();

        double dCpuDiff = dCpuEnd - dCpuStart;
        double dWallDiff = dWallEnd - dWallStart;

        pTimes->dCpuTime += dCpuDiff;
        pTimes->dWallTime += dWallDiff;
    }
}

These are the timings as measured on my pc (i5-9400f), timed in seconds.

I am very curious about the large discrepancies in performance and in Wall time to CPU time comparisons!

The code to run this is here:

    BenchTimes sTimes;

    bench(malloc_distr256, 1 << 20, 100, 1000, &sTimes);

    fprintf(stdout, "Malloc alloc/free bench allocated %lf megabytes, distributed over 256 chunks\n", (double)(1 << 20) / 1000000);
    fprintf(stdout, "Malloc alloc/free bench returned:\nWalltime - total: %lf\nCPU Time - total: %lf\n", sTimes.dWallTime / 1000000, sTimes.dCpuTime / 1000000);
    
    bench(malloc_distr64, 1 << 20, 100, 1000, &sTimes);

    fprintf(stdout, "\nMalloc alloc/free bench allocated %lf megabytes, distributed over 64 chunks\n", (double)(1 << 20) / 1000000);
    fprintf(stdout, "\nMalloc alloc/free bench returned:\nWalltime - total: %lf\nCPU Time - total: %lf\n", sTimes.dWallTime / 1000000, sTimes.dCpuTime / 1000000);

    bench(malloc_distr0, 1 << 20, 100, 1000, &sTimes);

    fprintf(stdout, "\nMalloc alloc/free bench allocated %lf megabytes, distributed over no chunks\n", (double)(1 << 20) / 1000000);
    fprintf(stdout, "\nMalloc alloc/free bench returned:\nWalltime - total: %lf\nCPU Time - total: %lf\n", sTimes.dWallTime / 1000000, sTimes.dCpuTime / 1000000);

    system("pause");

Why `long long pFreeList[256];`? Why not `void *pFreeList[256];`? — Some programmer dude, Feb 16 '21 at 18:22
@Someprogrammerdude Point taken- but that is besides the point of the question — FShrike, Feb 16 '21 at 18:23
Since you don't show a [mcve] (how, when and where do you call `bench`?) all we can do is guess, but my guess is that you don't reset the members of `pTimes` between calls. — Some programmer dude, Feb 16 '21 at 18:24
@Someprogrammerdude your guess is correct and I am a fool - thank you — FShrike, Feb 16 '21 at 18:42
@Someprogrammerdude I made the necessary changes and the results remained inconsistent! I'm about to post the minimal reproducible example — FShrike, Feb 16 '21 at 18:49
I'm going to guess "paging"? The OS has to page out memory in order to free enough to malloc, during which time the thread is idle? — Mooing Duck, Feb 16 '21 at 18:50
It is initialised in the function bench(), and the first call is the one with the least interesting results anyhow @Someprogrammerdude — FShrike, Feb 16 '21 at 18:54

score 5 · Accepted Answer · answered Feb 16 '21 at 18:50

5

malloc is implemented via HeapAlloc which is implemented in system function named RtlAllocateHeap.

The function manages heap. It allocates system memory pages via VirtualAlloc[Ex] or its equivalent, and provides smaller allocations within these pages.

With larger allocations, the VirtualAlloc[Ex] equivalent is called on every allocation, with smaller allocations it is called occasionally.

VirtualAlloc[Ex] is implemented using NtAllocateVirtualMemory which is kernel call. Most of the time spent in it is not counted in lpUserTime.

On the other hand, QueryPerformanceCounter is a honest total time.

answered Feb 16 '21 at 18:50

Alex Guteniev

12,039
2
34
79

So the process would spend as much as two seconds in the kernel when executing a 1MB malloc with no chunking? – FShrike Feb 16 '21 at 18:53
But it only spends 0.2 seconds in the kernel for benches with 256 and 64 allocations? I don't understand the variance – FShrike Feb 16 '21 at 18:55
@IdioticShrike, you can check on your own actually. `GetProcessTimes` returns kernel time as well. Add another function, or make your `cputime()` returning two values – Alex Guteniev Feb 16 '21 at 18:55
that made the wall times and cpu times within 0.1 of a second of each other every time - thank you for pointing this out – FShrike Feb 16 '21 at 18:59
Larger allocation (significantly larger than system memory page, which is 65536 bytes) may indeed take more kernel time than small, as the system has to find contiguous piece in process's virtual address space – Alex Guteniev Feb 16 '21 at 19:03

Odd discrepancy in CPU vs Wall time measurements with malloc

1 Answers1