I'm trying to do a typical "A/B testing" like approach on two different implementations of a real-life algorithm, using the same data-set in both cases. The algorithm is deterministic in terms of execution, so I really expect the results to be repeatable.
On the Core 2 Duo, this is also the case. Using just the linux "time" command I'll get variations in execution time around 0.1% (over 10 runs).
On the i7 I will get all sorts of variations, and I can easily have 30% variations up and down from the average. I assume this is due to the various CPU optimizations that the i7 does (dynamic overclocking etc), but it really makes it hard to do this kind of testing. Is there any other way to determine which of 2 algorithms is "best", any other sensible metrics I can use ?
Edit: The algorithm does not sustain for very long and this is actually the real-life scenario I'm trying to benchmark. So running repeatedly is not really an option as such.