21

I'm migrating my application from windows 7 to windows 10.
All functions were worked without any changes, but execution time was slower than windows 7.
It seems object construction/destruction was slow. Then I created simple benchmark program regarding malloc() and free() such as below.

for (int i = 0; i < 100; i++)
{
  QueryPerformanceCounter(&gStart);
  p = malloc(size);
  free(p);
  QueryPerformanceCounter(&gEnd);
  printf("%d, %g\n", i, gEnd.QuadPart-gStart.QuadPart);
  if (p == NULL)
    printf("ERROR\n", size);
}

I ran this program in both windows 7 and windows 10 on same PC. I measured malloc() and free() performance when data size is 1, 100, 1000, 10000, 100000, 1000000, 10000000 and 100000000 bytes.
In all above cases, windows 10 is slower than windows 7.
Especially, windows 10 is slow more than tenfold windows 7 when data size is 10000000 and 100000000.

When data size is 10000000 bytes

  • Windows 7 : 0.391392 msec
  • Windows 10 : 4.254411 msec

When data size is 100000000 bytes

  • Windows 7 : 0.602178 msec
  • Windows 10 : 38.713946 msec

Do you have any suggestions to improve it on windows 10?

I've experimented with the followings in windows 10, but performance was not improved unfortunately.

  • Disabled superfetch
  • Disabled Ndu.sys
  • Disk cleanup

Here is the source code. (updated Feb 15th)

#include "stdafx.h"

#define START_TIME  QueryPerformanceCounter(&gStart);
#define END_TIME    QueryPerformanceCounter(&gEnd);

#define PRT_FMT(fmt, ...)   printf(fmt, __VA_ARGS__); 
#define PRT_TITLE(fmt, ...) printf(fmt, __VA_ARGS__); gTotal.QuadPart = 0;
#define PRT_RESULT  printf(",%d", gEnd.QuadPart-gStart.QuadPart); gTotal.QuadPart+=(gEnd.QuadPart-gStart.QuadPart);
#define PRT_END printf("\n");
//#define PRT_END       printf(",total,%d,%d\n", gTotal.QuadPart, gTotal.QuadPart*1000000/gFreq.QuadPart);


LARGE_INTEGER gStart;
LARGE_INTEGER gEnd;
LARGE_INTEGER gTotal;
LARGE_INTEGER gFreq;

void
t_Empty()
{
    PRT_TITLE("02_Empty");
    START_TIME
    END_TIME; PRT_RESULT
    PRT_END
}
void
t_Sleep1234()
{
    PRT_TITLE("01_Sleep1234");
    START_TIME
        Sleep(1234);
    END_TIME; PRT_RESULT
    PRT_END
}

void*
t_Malloc_Free(size_t size)
{
    void* pVoid;

    PRT_TITLE("Malloc_Free_%d", size);
    for(int i=0; i<100; i++)
    {
        START_TIME
        pVoid = malloc(size);
        free(pVoid);
        END_TIME; PRT_RESULT
        if(pVoid == NULL)
        {
            PRT_FMT("ERROR size(%d)", size);
        }

    }
    PRT_END

    return pVoid;
}

int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    QueryPerformanceFrequency(&gFreq);
    PRT_FMT("00_QueryPerformanceFrequency, %lld\n", gFreq.QuadPart);

    t_Empty();
    t_Sleep1234();

    for(i=0; i<10; i++)
    {
        t_Malloc_Free(1);
        t_Malloc_Free(100);
        t_Malloc_Free(1000);        //1KB
        t_Malloc_Free(10000);
        t_Malloc_Free(100000);
        t_Malloc_Free(1000000);     //1MB
        t_Malloc_Free(10000000);    //10MB
        t_Malloc_Free(100000000);   //100MB
    }
    return 0;
}

Result in my environment (built by VS2010 and windows 7) In 100MB case :

  • QPC count in windows 7 : 11.52 (4.03usec)

  • QPC count in windows 10 : 973.28 (341msec)

pleiades92
  • 243
  • 1
  • 6
  • 15
    Assessing performance of `malloc()/free()` without _using_ that allocated memory can give false results. http://stackoverflow.com/q/19991623/2410359 Suggest using the allocated memory to see if the performance difference is real. – chux - Reinstate Monica Oct 06 '16 at 15:03
  • @chux good find, seems related to this one anyway – stijn Oct 06 '16 at 15:15
  • Thanks for helpful suggestion. I'll update my benchmark program including to access allocated memory and measure with it again. – pleiades92 Oct 07 '16 at 06:13
  • 1
    Are you running this in a debugger, it may be the slow debug heap if so? – jcoder Nov 15 '16 at 08:09
  • Thanks for helpful suggestion. No, I built this program with release configuration and I didn't use debugger. – pleiades92 Nov 18 '16 at 07:43
  • This is impossible, they may have some difference but not at this magnitude. – Stargateur Feb 07 '17 at 10:41
  • Which compiler are you using? Is it the same on both Windows 7 and Windows 10? – BillyJoe Feb 07 '17 at 14:25
  • Can you provide the source code reproducing your test? – Simon Mourier Feb 07 '17 at 15:01
  • How, exactly are you running both Windows 7 and Windows 10 on the same machine? Is it some sort of dual boot? a virtual machine? or some other mechanism? – Peter Camilleri Feb 07 '17 at 15:22
  • 1
    I am not the original poster, I've added the bounty because we have run into exactly the same problem on one of our windows 10 machines (pre-anniversary update) running production software. When examining the application using Intel VTune Amplifier, the machine in question shows a significant increase in time running new() and free() operations. A developer machine running w10 w/ anniversary update does not show the same behavior. – Jens Habegger Feb 07 '17 at 15:27
  • Answering previous comments: mbjoe: It's the very same compiler (VS2010 C++). Simon Mourier: Unfortunately not. Peter Camiller: They are different machines altogether, switching OSs on a production machine is considered a last resort right now. – Jens Habegger Feb 07 '17 at 15:30
  • I read that you should add `#define _NO_DEBUG_HEAP=1` even when compiling the release version. Also, I would try to understand if you are using the same VS2010 DLLs on both environments. – BillyJoe Feb 08 '17 at 07:54
  • @mbjoe: Thank you, we'll check if using the define makes any difference. The respective DLL versions have been checked, their versions do match. – Jens Habegger Feb 08 '17 at 07:57
  • @JensHabegger - Windows 10 has a new "segment heap" used in certain cases (mostly store apps): https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf maybe that's related to perf problems you see. Difficult to help w/o any reproducing code – Simon Mourier Feb 08 '17 at 14:20
  • @SimonMourier Thanks, segment heap settings have been checked, they were not set (not per-executable, not globally). We also tried specifically disabling the setting (again, both per-executable and globally) without effect. – Jens Habegger Feb 08 '17 at 14:25
  • @mbjoe setting `#define _NO_DEBUG_HEAP=1` has no effect. – Jens Habegger Feb 09 '17 at 07:48
  • 1
    Hello, thanks for your comments, At first, I compiled using VS2010 compiler and 'release' build setting. Windows 7 and Windows 10 work on same machine and using dual boot. I'll post source code later. I need to search it.... – pleiades92 Feb 10 '17 at 05:54
  • 1
    You guys are chasing ghosts I'm afraid. The explanation for the strange behavior in the original code is merely incorrect use of the `printf` function. – Lundin Feb 10 '17 at 15:30
  • Are you using both Windows versions in 32 bits, 64 bits or are both different? – hm1912 Feb 11 '17 at 11:16
  • @Xeneda I'm using 64bit only. – pleiades92 Feb 15 '17 at 04:50
  • Hello, thanks for your comments. I've updated source code. Please find it. – pleiades92 Feb 15 '17 at 09:58
  • I filed another memory management issue (multhtreaded access scenarios) in Windows 10 here: https://stackoverflow.com/questions/45024029/windows-10-poor-memory-access-scalability-compared-to-windows-7 – nikoniko Jul 11 '17 at 01:56

2 Answers2

7

One thing that may have some impact is that the internals of the QueryPerformanceCounter API have apparently changed from Windows 7 to Windows 8. https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx

Windows 8, Windows 8.1, Windows Server 2012, and Windows Server 2012 R2 use TSCs as the basis for the performance counter. The TSC synchronization algorithm was significantly improved to better accommodate large systems with many processors.


More importantly, your benchmarking code in itself is broken. QuadPart is of type LONGLONG, as is the expression gEnd.QuadPart-gStart.QuadPart. But you print this expression with the %g format specifier which expects a double. So you invoke undefined behavior and the output you have been reading is complete nonsense.

Similarly, printf("ERROR\n", size); is another bug.


That being said, operative systems often don't do the actual heap allocation before that memory area is actually used. Meaning that there is probably no actual allocation taking place in your program.

To counter this behavior during benchmarking, you have to actually use the memory. For example, you could add something like this to ensure that the allocation is actually taking place:

p = malloc(size);
volatile int x = i;
p[0] = x;
free(p);
Lundin
  • 195,001
  • 40
  • 254
  • 396
  • Just asking: Have you attempted to reproduce his error with correct code? – Dellowar Feb 10 '17 at 15:40
  • 2
    if size is larger than one byte, it might be wise to memset the entire buffer so the whole thing is brought in to memory. – EvilTeach Feb 10 '17 at 16:15
  • @SanchkeDellowar No, I don't have a computer with two Windows partitions, which would be the requirement to properly benchmark any difference. – Lundin Feb 13 '17 at 07:24
  • @EvilTeach I'm not so sure. `memset` will get inlined by the optimizer, and from there it might decide that "hmm you don't seem to be using all this memory anyway, lets skip the memset of parts that aren't used". You could of course make the whole buffer `volatile`, but that would probably ruin the benchmarking. – Lundin Feb 13 '17 at 07:26
  • It would be a surer test to make the test write a byte every 4kb or so to ensure pages are mapped and then compare the two OS's. Yet the reports of production software running slower with real PA tools pointing to memory allocation seem pretty convincing. – Gene Feb 14 '17 at 05:05
  • I'm awarding the bounty to this answer, even though I have a feeling that we're still not quite there yet. Thanks everyone for your help! – Jens Habegger Feb 14 '17 at 07:33
  • Thank you for your help. I've revised my source code as much as possible and updated it into the tail of beginning question. I also tried to add memset. However, execution time for memset was very slow. Therefore, I was not sure execution time for malloc and free. – pleiades92 Feb 15 '17 at 10:03
-7

Performance depends on many facts like os,ram,cpu etc.

I ran this program in both windows 7 and windows 10 on same PC

cpu compute things quicker.i suspect your cpu and ram are not enough to match windows 10 and good for windows 7 (lighter image than 10) i suggest to try on other system with windows 10 in which cpu fetching instruction form ram should be faster enough and ram size should be matching the cpu as your are getting on windows 7 and make sure to close all the application running in backend.

leuage
  • 566
  • 3
  • 17