3

Answer: user and kernel cpu time is measured separately, be careful!

EDIT: pTimes is now reset to zero between each benchmark, but the results got weirder!

With the end goal of fiddling with my own custom memory management schemes, I made a simple benchmark for the existing malloc() in Visual Community 2019, Windows 10. Out of interest, I benchmarked the CPU times as well as wall times, and I tested malloc by allocating a large block of memory in many chunks, and then individually freeing each chunk without using them. See here:

void malloc_distr256(int nMemsize) {
    long long pFreeList[256];

    for (int i = 0; i < 256; ++i) pFreeList[i] = malloc(nMemsize >> 8);
    for (int i = 0; i < 256; ++i) free((void*)pFreeList[i]);
}

void malloc_distr64(int nMemsize) {
    long long pFreeList[64];

    for (int i = 0; i < 64; ++i) pFreeList[i] = malloc(nMemsize >> 6);
    for (int i = 0; i < 64; ++i) free((void*)pFreeList[i]);
}

void malloc_distr0(int nMemsize) {
    void* pMem = malloc(nMemsize);

    free(pMem);
}

These functions I benchmarked using the following code - "BenchTimes" is just a struct holding double CPU/wall times:

inline double cputime() {
    FILETIME lpCreationTime;
    FILETIME lpExitTime;
    FILETIME lpKernelTime;
    FILETIME lpUserTime;

    if (GetProcessTimes(GetCurrentProcess(), &lpCreationTime, &lpExitTime, &lpKernelTime, &lpUserTime)) {
        double dUnits = (double)(lpUserTime.dwLowDateTime | (long long)lpUserTime.dwHighDateTime << 32);
        return dUnits * 0.1;
    }
    else return 0xFFF0000000000000;
}

inline double walltime() {
    LARGE_INTEGER lnFreq, lnTime;

    if (QueryPerformanceFrequency(&lnFreq)) if (QueryPerformanceCounter(&lnTime))
        return 1000000.0 * (double)lnTime.QuadPart / (double)lnFreq.QuadPart;
//multiply by 1,000,000 to convert seconds to microseconds
//because the cpu time measurer I had in microseconds as well
    return 0.0;
}

void bench(void (pfnFunc)(int), int nMemsize, int nIters, int nReps, BenchTimes* pTimes) {
    pTimes->dCpuTime = 0.0;
    pTimes->dWallTime = 0.0;

    for (volatile int r = 0; r < nReps; ++r) {
        double dCpuStart = cputime();
        double dWallStart = walltime();

        for (volatile int i = 0; i < nIters; ++i) pfnFunc(nMemsize);

        double dCpuEnd = cputime();
        double dWallEnd = walltime();

        double dCpuDiff = dCpuEnd - dCpuStart;
        double dWallDiff = dWallEnd - dWallStart;

        pTimes->dCpuTime += dCpuDiff;
        pTimes->dWallTime += dWallDiff;
    }
}

These are the timings as measured on my pc (i5-9400f), timed in seconds. enter image description here

I am very curious about the large discrepancies in performance and in Wall time to CPU time comparisons!

The code to run this is here:

    BenchTimes sTimes;

    bench(malloc_distr256, 1 << 20, 100, 1000, &sTimes);

    fprintf(stdout, "Malloc alloc/free bench allocated %lf megabytes, distributed over 256 chunks\n", (double)(1 << 20) / 1000000);
    fprintf(stdout, "Malloc alloc/free bench returned:\nWalltime - total: %lf\nCPU Time - total: %lf\n", sTimes.dWallTime / 1000000, sTimes.dCpuTime / 1000000);
    
    bench(malloc_distr64, 1 << 20, 100, 1000, &sTimes);

    fprintf(stdout, "\nMalloc alloc/free bench allocated %lf megabytes, distributed over 64 chunks\n", (double)(1 << 20) / 1000000);
    fprintf(stdout, "\nMalloc alloc/free bench returned:\nWalltime - total: %lf\nCPU Time - total: %lf\n", sTimes.dWallTime / 1000000, sTimes.dCpuTime / 1000000);

    bench(malloc_distr0, 1 << 20, 100, 1000, &sTimes);

    fprintf(stdout, "\nMalloc alloc/free bench allocated %lf megabytes, distributed over no chunks\n", (double)(1 << 20) / 1000000);
    fprintf(stdout, "\nMalloc alloc/free bench returned:\nWalltime - total: %lf\nCPU Time - total: %lf\n", sTimes.dWallTime / 1000000, sTimes.dCpuTime / 1000000);

    system("pause");
FShrike
  • 323
  • 1
  • 10

1 Answers1

5

malloc is implemented via HeapAlloc which is implemented in system function named RtlAllocateHeap.

The function manages heap. It allocates system memory pages via VirtualAlloc[Ex] or its equivalent, and provides smaller allocations within these pages.

With larger allocations, the VirtualAlloc[Ex] equivalent is called on every allocation, with smaller allocations it is called occasionally.

VirtualAlloc[Ex] is implemented using NtAllocateVirtualMemory which is kernel call. Most of the time spent in it is not counted in lpUserTime.

On the other hand, QueryPerformanceCounter is a honest total time.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • So the process would spend as much as two seconds in the kernel when executing a 1MB malloc with no chunking? – FShrike Feb 16 '21 at 18:53
  • But it only spends 0.2 seconds in the kernel for benches with 256 and 64 allocations? I don't understand the variance – FShrike Feb 16 '21 at 18:55
  • @IdioticShrike, you can check on your own actually. `GetProcessTimes` returns kernel time as well. Add another function, or make your `cputime()` returning two values – Alex Guteniev Feb 16 '21 at 18:55
  • that made the wall times and cpu times within 0.1 of a second of each other every time - thank you for pointing this out – FShrike Feb 16 '21 at 18:59
  • Larger allocation (significantly larger than system memory page, which is 65536 bytes) may indeed take more kernel time than small, as the system has to find contiguous piece in process's virtual address space – Alex Guteniev Feb 16 '21 at 19:03