It's best not to confuse the goals of measuring and optimizing. They are different tasks. Measuring, while good for quantifying the results of fixing something, is poor at telling you what to fix.
On the subject of measuring, if I want to make a presentation of how fast something is, I want a controlled environment and a good stopwatch. However, for the purpose of optimizing, crude measurements (like running something 1000 times) and using a simple timer, are more than good enough.
On the subject of optimizing, I am not concerned with speed.
I am concerned with needless activity.
It is not necessary to run the code at high speed to find it.
When any program runs, it traces out a call tree.
Optimizing consists of removing as many leaves (instructions) and as much fruit (I/O) as possible.
A good way to do this is to prune whole branches.
- In all but the smallest programs, the typical opportunities for optimization consist of call points (lines of code where functions are called, not the functions themselves) that, when one realizes how much of the call tree sprouts from them, could really be done another way. A single innocent-looking line of code could be responsible for a large fraction of the entire tree, and you might be able to simply chop it off.
To find those, I think wall-clock time stack sampling is the best way. It is not necessary for this to be an efficient process, and a rather small number of samples works just as well (or better) than a large number of samples.
It is necessary to do it repeatedly, because any given program, as first written, doesn't contain just one opportunity for speedup. It contains several.
Here's an example.