the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations
It doesn't make sense to talk about "operations" without defining the machine model and cost model within which we model those "operations". Or, to put it simply: it doesn't make sense to count things until you are clear what you are counting.
In this case, you are counting additions. And you are correct: within your model which counts only additions, both versions have the same amount of additions.
They do, however, not have the same amount of block activations.
Remember, Integer#times
looks roughly like this:
class Integer
def times
return enum_for(__callee__) unless block_given?
return self unless positive?
i = -1
yield i while (i += 1) < self
self
end
end
So, for each iteration of the loop, there is an activation (i.e. a yield
) of the block passed to Integer#times
.
If we add that as a new class of "operation", we have the following:
one_loop
: 20,000,000 additions and 10,000,000 block activations
two_loops
: 20,000,000 additions and 20,000,000 block activations
So, both methods have the same number of additions but two_loops
has twice the number of block activations.
Which means, we also have to take into account the relative costs of additions versus block activations. Now, semantically, an addition is just a normal method call. And activating a block is sort-of similar to a method call.
So, we would expect that an addition and a block activation have roughly similar cost, which means our cost would be:
one_loop
: 30,000,000 "method call like operations"
two_loops
: 40,000,000 "method call like operations"
In other words, we would expect two_loops
to be 33% slower or one_loop
to be 25% faster, depending on how you look at it.
However, we actually find that the difference is much bigger, so clearly we are missing something in our model.
What we are missing are optimizations. Arithmetic operations on integers are extremely common and extremely performance-critical, and so all Ruby implementations go to great lengths to make them fast. In fact, on all Ruby implementations, simple additions such as the one you are using will be mapped directly to single CPU ADD
instructions, and will not incur the overhead of a method call at all.
Block activations are also very important in Ruby, and thus they are also heavily optimized, but they are just fundamentally several orders of magnitude more complex than adding two machine-word integers.
In fact, the relative complexity of block activations to machine-word integer additions is so big, that we can actually ignore the additions completely in our model:
one_loop
: 10,000,000 block activations
two_loops
: 20,000,000 block activations
This gives us a factor of 2:1, so we would expect two_loops
to be 100% slower or one_loop
to be 50% faster.
By the way, I ignored another operation that is happening: the definition and initialization of the local variables. The argument is similar: that is an operation that is so fast that it is negligible compared to the block activation.
Actually, so far, we have only talked about the relative cost of those operations and how they mean that we can ignore the cost of the additions and the local variables. There is, however, an even stronger reason to ignore those: optimizations.
Even the most simple Ruby implementation will be able to optimize away the local variables completely: they are defined and initialized in only one place, and they are never accessed ever again. They only exist within the scope of the block, for the duration of one activation of the block, so even a very simple optimizer can see that they are completely useless, so even the most simple of optimizers will optimize the code to something roughly like this:
def one_loop
N.times do
1+1
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Meaning that we can not only ignore the cost of the local variables because it is small compared to the other costs, but actually, the local variables don't even exist.
Likewise, a slightly smarter optimizer will recognize that the first addition in one_loop
has no side-effects, is not returned, is not stored in a variable (or at least not in one that is used anywhere), and in general does not influence the outcome of the computation in any way, shape, or form, and thus will optimize the code to this:
def one_loop
N.times do
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Furthermore, the same argument actually applies to the remaining addition. It has no side-effect, all it does is being returned from the block, but Integer#times
ignores the return value of the block. I haven't taken a look at the generated code, but I strongly suspect that even the most stupid optimizer can easily prove that your blocks are no-ops, and thus it will optimize the code to something roughly like this:
def one_loop
N.times do
end
end
def two_loops
N.times do
end
N.times do
end
end
Which means that one_loop
has N
iterations of the block, two_loops
has 2 * N
iterations, and thus should take roughly twice as long.
Now, we can see in your benchmarks that the numbers are not actually 2:1. They are 1.75:1 or roughly 7:4.
I can confirm those results on my machine, here with YARV 2.7.1 without JIT, I get almost exactly 7:4:
user system total real
two smaller loops 0.711479 0.000099 0.711578 ( 0.711680)
one large loop 0.401808 0.000059 0.401867 ( 0.401916)
However, when I turn on the JIT, I get almost exactly the 2:1 we would expect:
user system total real
two smaller loops 0.587017 0.000279 0.587296 ( 0.587098)
one large loop 0.291713 0.000062 0.291775 ( 0.291779)
You will also notice that it is faster in general.
With JRuby 9.2.9.0, we get again slightly faster execution and almost 2:1:
user system total real
two smaller loops 0.740000 0.010000 0.750000 ( 0.517670)
one large loop 0.260000 0.000000 0.260000 ( 0.263270)
This was with default options, here are the results with some more aggressive compiler flags:
user system total real
two smaller loops 0.370000 0.000000 0.370000 ( 0.362050)
one large loop 0.390000 0.010000 0.400000 ( 0.213861)
TruffleRuby 20.1.0 is again much faster than even JRuby:
user system total real
two smaller loops 0.009955 0.000039 0.009994 ( 0.010035)
one large loop 0.004759 0.000007 0.004766 ( 0.004742)
And again, very close to 2:1. Also, even though we are only interested in the relative performance of the two methods, it is quite pleasing to see that TruffleRuby is 70x-100x faster than YARV on this benchmark!
Actually, I am somewhat surprised that TruffleRuby could not prove that Integer#times
with an empty block body is a no-op. I would have expected it to be able to optimize the code like this:
def one_loop
end
def two_loops
end
And thus have no runtime difference between the two versions at all.
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
I would say all of the above.
The main problem is that you are measuring almost the exact opposite of what you are counting. You are counting only additions and ignoring block activations, and there is nothing wrong with that, IFF all you are interested in is the number of additions and nothing else.
And you are measuring only the cost of block activations and ignoring the cost of additions, which is also totally fine if that is what you are interested in.
The problem is that those two do not match up: you are not measuring what you are counting and you are not counting what you are measuring, so you simply cannot draw any conclusions from the results of your experiments towards your hypothesis.
In one of your comments, you asked:
so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop?
We can't tell you that. You need to define what operations you are interested in, and what operations you want to ignore. If you define "operation" to only mean "addition", then no, each iteration of the loop does not count as its own operation, and both of your examples have the exact same amount of operations.
Another problem is that your hypothesis "the number of additions is the same, therefore the execution time is the same" is invalid, because additions are not the only operations that take time. And even if you count other kinds of operations, then your hypothesis still assumes that every operation takes the same amount of time, which is also not true.
There are also some more problems with your benchmarking approach in general, which however are not the source of your confusion. Here are some of the problems with your benchmarks I have identified, although I am sure there others:
- Your benchmarks are written in a way that all the operations you are interested about are optimized away, leaving only operations you are not interested about.
- Even if they weren't optimized away, their execution time negligible compared to the execution time of the operations you don't care about.
- Your benchmarks don't run long and often enough to give the optimizer a chance. For example, the default threshold for when a method is compiled is between 20 and 20000 calls depending on the Ruby implementation, compiler flags, etc. Both your methods are only called twice, once during rehearsal, once during the real thing. You would need to make sure they are called a lot more than 20000 times to ensure that a) they get compiled at all, and b) there are enough iterations after they have been compiled that the slower iterations before they were compiled do not significantly affect the outcome.
I always recommend that someone who wants to write a benchmark read and understand the following mailing list thread:
JMH vs Caliper: reference thread
Especially the sub-thread and following discussion starting at the linked message.
While this thread is about a specific benchmark tool for benchmarking Java code, anything that is discussed in the thread applies to all benchmarking on all modern high-performance language implementations.
There is a reason why benchmarks are written by benchmark engineers who have writing benchmarks as their full-time job: it takes a lot of knowledge and expertise to write benchmarks.
You need, at least
- Deep knowledge of computer organization in general.
- Deep knowledge of all the specific hardware platforms that you are benchmarking on.
- Deep knowledge of programming languages in general.
- Deep knowledge of al the specific programming languages you are writing benchmark code in.
- Deep knowledge of language implementations in general, including but not limited to ahead-of-time compilers, JIT compilers, interpreters, VMs, garbage collectors, memory allocators, optimizers, inlining, loop unrolling, dead code elimination, constant folding, common subexpression elimination, tail call elimination, peephole optimizations, compile-time evaluation, Polymorphic Inline Caching, and lots more.
- Deep knowledge of the specific language implementations you are running your code on.
- And more, e.g. operating systems, scheduling, NUMA, multi-threading, SMT, CMT, …
And when you have all of that, you are presented with a bunch of numbers you need to know how to interpret, which requires deep knowledge of statistics.
The benchmark
library in the Ruby stdlib is an example of many of those "sins". It was perfectly fine 25 years ago, when there was only one single Ruby implementation which was simply a dumb AST-walking interpreter with no optimizations whatsoever, and computers had one CPU with no out-of-order execution or speculation, and that was it. But for current times, where we have a multitude of aggressively optimizing Ruby implementations (most prominently TruffleRuby) and complex CPUs that perform their own optimizations as well, it just doesn't cut it anymore.
Unfortunately, there is no benchmark tool of comparable sophistication to what's available e.g. in the Java world, but there are at least some alternatives such as better-benchmark
(no longer maintained), benchmark-ips
(by Evan Phoenix, founder of Rubinius), or fruity
(by Marc-André Lafortune, ruby-core team member).