5

I'm porting C++ code from Linux to Windows. During this process, I found out that the following line takes ~10 times slower under Windows (on exactly the same hardware):

list<char*>* item = new list<char*>[160000]; 

On Windows it takes ~10ms, while on Linux it takes ~1ms. Note that this is the average time. Running this row 100 times takes ~1 second on Windows.

This happens both on win32 and x64, both versions are compiled in Release, and the speed is measured via QueryPerformanceCounter (Windows) and gettimeofday (Linux).

The Linux compiler is gcc. The Windows compiler is VS2010.

Any idea why could this happen?

Community
  • 1
  • 1
Doron Yaacoby
  • 9,412
  • 8
  • 48
  • 59
  • 3
    I see a flamewar rising... anyways, you should probably do stuff in a loop or so, these numbers probably don't mean that much as 10ms is easily disturbed by other stuff going on. measure in the amount of a few seconds at least. But even then I would guess that teh defaul way to allocate memory on windows is slower. Just play around a bit and make it more like a pattern that you would use in your program. – PlasmaHH Jan 26 '12 at 15:37
  • 10
    @EdHeal: we fixed the time machine. you can now travel back to 1998 and live happily there. – Sedat Kapanoglu Jan 26 '12 at 15:42
  • @ssg not sure. On my side I tried my best to have a 5ms-resolution timer in Qt and no way of doing it on Windows. – UmNyobe Jan 26 '12 at 15:47
  • @ssg - You need to do a lot of tuning to windows to get it to work well - like going through the local services and turing a lot of them off! As for linux this process is a lot easier and those that you do not use seem to disapper into the background and does not use either CPU/Memory. – Ed Heal Jan 26 '12 at 15:56
  • 3
    @UmNyobe: the fact that you couldn't find QueryPerformanceCounter does not mean Windows is "bloated". – Mooing Duck Jan 26 '12 at 15:57
  • 2
    1) Why do you allocate 160k lists of char pointers? This seems to be ... curious? 2) Which compiler? (VC 2008 has a large performance problem with the STL, until you define something special ...) – Christopher Jan 26 '12 at 15:58
  • 5
    @EdHeal This is nothing to do with the OS. You appear to be saying that if Doron switched off a few services then `new` would be faster. Piffle say I! – David Heffernan Jan 26 '12 at 16:04
  • 1
    @Doron You really need to give full details of both compilers and how you are measuring. – David Heffernan Jan 26 '12 at 16:05
  • @david - Just know that when I switched off unused services in my PC things worked faster - including allocating memory. Simply because the CPU has more cycles to do other things including sorting out the memory allocation betweeen processes and file IO. You just have a limited amout of memory/cpu cycles/file io. If you are up the limits or want to get better performance just turn things off that you do not need. Most OS (including linux) turn a lot of things on that you can live without. Truy it! – Ed Heal Jan 26 '12 at 16:24
  • 2
    @EdHeal Looking at Process Explorer on my machine, while it idles, I see no services consuming CPU cycles. Sorry, but I just don't buy it. – David Heffernan Jan 26 '12 at 16:36
  • @david - I get the occasional blip - Also accessing the disk/memory in a continous stream is more efficent. Well it works for my PC (XP) – Ed Heal Jan 26 '12 at 16:56
  • @Christopher: It's not really char*, it is actually a custom class, but since this happens with char* as well, I figured it would make the question simpler. I will update the question with the compiler details. – Doron Yaacoby Jan 26 '12 at 18:42
  • 3
    @DoronYaacoby The VC10 STL-list constructor allocates a (empty) node in the constructor. This could really be a big performance drawback in the 'construct' case. On the other hand, I don't know what the gcc STL does when it constructs a std::list, maybe it trades size for speed, and the VC one trades speed for size. I/you will have to take a look at the constructor(-chain) of the gcc version. Has this a performance impact in your application? – Christopher Jan 26 '12 at 19:11
  • 3
    @DoronYaacoby After I looked at the libstdc++ implementation (of the version I found) it seems, that the root node is not allocated from the heap. That means that the list construction in gcc is a bit faster then the vc version. I didn't look at implementation differences outside of this. Maybe this has another drawback in the implementation, so that the dikumware implementers chose the allocate root node way ... . But there we have it, allocating the root node 160k times seems to slow down your benchmark. – Christopher Jan 26 '12 at 19:23
  • @Christopher, you were right indeed, the list<> ctor is much slower on Windows. It does make a difference for my application, will have to find a way to overcome this. Thanks! – Doron Yaacoby Jan 30 '12 at 07:16
  • Also, @Christopher: If you post this information as an answer, I think it will be a better one than the ones this question have. – Doron Yaacoby Jan 30 '12 at 07:24

2 Answers2

10

It could be more an issue of library implementation. I would expect a single allocation in most cases, with the default constructor for list not allocating anything. So what you're trying to measure is the cost of the default constructor of list (which is executed 160000).

I say "trying to measure", because any measurements that small are measuring clock jitter and resolution more than they're measuring code execution times. You should put this in a loop, to execute it frequently enough to get a runtime of a couple of seconds. And when you do this, you need to take precautions to ensure that the compiler doesn't optimize anything out.

And under Linux, you want to measure using clock(), at least; the wall clock time you get from gettimeofday is very dependent on what else happens to happen at the same time. (Don't use clock() under Windows, however. The Windows implementation is broken.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • 1
    I agree, this is measuring the cost of `std::list`'s constructor, not memory allocation. Further, it would be trivial to measure the two separately, using allocation followed by *placement-`new[]`*. – Ben Voigt Jan 26 '12 at 19:04
  • Turned out to be true, the list<> ctor is where most of the time went. – Doron Yaacoby Jan 30 '12 at 07:15
2

I think this instruction takes less time in both OS (regardless of anything). In this case It take such few time that you may be actually measuring the resolution of your timers.

zdan
  • 28,667
  • 7
  • 60
  • 71
UmNyobe
  • 22,539
  • 9
  • 61
  • 90