1

I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.

How do people make accurate measurements through measuring the time of execution of a specific piece of code?

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Razvan C
  • 45
  • 7
  • 1
    Do a million measurements, and get the average. Turning off preemption on a modern day system is really not recommended. – Some programmer dude Mar 12 '17 at 12:49
  • Basically a good idea to turn off preemption. Did you find a way to achieve that? I'd say it's not possible, so you have to live with it. Measuring with preemption is quite realistic, because that's what your customer has. – Thomas Weller Mar 12 '17 at 12:50
  • Hi & thank you for the comments, no there's no way I know of disabling preemption, I'm actually hoping there's some sort of counter that only increments when my code is running. – Razvan C Mar 12 '17 at 13:02
  • 1
    If the code is short, you can filter out preemption overhead by ignoring outliers. If the code is long, it makes a statistically insignificant difference. So it's generally fine to run with preemption, except for "middle length", where preemption would add a "noticeable but hard to filter" overhead (in that case, just iterate more) – harold Mar 12 '17 at 13:05
  • What are you actually trying to measure and why? Interrupts are unavoidable overhead, if you want to measure the actual performance of your code (as opposed to what the performance might hypothetically be in a world where there were no interrupts) it seems to me that you need to include them. – Harry Johnston Mar 12 '17 at 22:14
  • Well actually I was curious about how some implementation could be measured accurately. If I'm testing quicksort now to see how fast it is on my machine and afterwards mergesort is running but at the same time in the background some torrent is going crazy with new connections, the measurement for mergesort might be slower but not because of the implementation or algorithm's speed. I was thinking maybe there would be some profiling help from some api that could actually only do counting when the task is in execution and when it yields the processor the timer stops until it gets it back again. – Razvan C Mar 12 '17 at 23:00
  • You could try setting the thread priority to THREAD_PRIORITY_TIME_CRITICAL while the code was being timed. In theory, no other threads would run while it was running. Interrupts would still be handled, but hopefully would all be very low overhead. – Ferruccio Mar 27 '17 at 19:33

2 Answers2

1

Just went to a conference with Andrei Alexandrescu on Fastware and he was adressing this exact issue, how to measure speed. Apparently getting the mean is a bad idea BUT, measuring many times is a great idea. So with that in mind you measure a million times and remember the smallest measurement because in fact that's where you would get the least amount of noise.

Means are awful because you're actually adding more of the noise's weight to the actual speed you're measuring (these are not the only things that you should consider when evaluating code speed but this is a good start, there's even more horrid stuff regarding where the code will execute, and the overhead brought by the code starting execution on one core and finishing on another, but that's a different story and I don't think it applies to my sort).

A good joke was: if you put Bill Gates into a bus, on average everybody in that bus is a millionaire :))

Cheers and thanks to all who provided input.

Razvan C
  • 45
  • 7
1

Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).

That can hugely influence your measured times. Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.

Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.

Threads Size_MB Time_ms us/Page MB/s    Scenario
1       2000    355     0.693   5634    Touch 1
1       2000    11      0.021   N.a.    Touch 2
2       2000    276     0.539   7246    Touch 1
2       2000    12      0.023   N.a.    Touch 2
3       2000    274     0.535   7299    Touch 1
3       2000    13      0.025   N.a.    Touch 2
4       2000    288     0.563   6944    Touch 1
4       2000    11      0.021   N.a.    Touch 2

// Touch is from the compiler point of view a nop operation with no observable side effect 
// This is true from a pure data content point of view but performance wise there is a huge
// difference. Turn optimizations off to prevent the compiler to outsmart us.
#pragma optimize( "", off )
void Program::Touch(void *p, size_t N)
{
    char *pB = (char *)p;
    char tmp;
    for (size_t i = 0; i < N; i += 4096)
    {
        tmp = pB[i];
    }

}
#pragma optimize("", on)

To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.

Alois Kraus
  • 13,229
  • 1
  • 38
  • 64