Why are these two smaller loops so much slower than a single loop containing the same instructions?

Question

Ruby v2.3.3p222
MacOS Catalina v10.15.3
Processor: 2.6GHz 6-Core Intel Core i7

I have the following performance benchmark script, which was designed to test the difference between one larger loop operation vs. two smaller loops:

require 'benchmark'

N = 10_000_000

def one_loop
  N.times do
    foo = 1+1
    bar = 2+2
  end
end

def two_loops
  N.times do
    foo = 1+1
  end

  N.times do
    bar = 2+2
  end
end


Benchmark.bmbm do |performance|
  performance.report("two smaller loops") { two_loops }
  performance.report("one large loop") { one_loop }
end

My hypothesis was that these two methods would execute in approximately the same amount of time, because (I thought) they were both executing the same number of instructions: the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations.

However, that doesn't appear to be what I observed. When I run the script, I get the following output:

Rehearsal -----------------------------------------------------
two smaller loops   0.840000   0.000000   0.840000 (  0.838101)
one large loop      0.500000   0.010000   0.510000 (  0.506283)
-------------------------------------------- total: 1.350000sec

                        user     system      total        real
two smaller loops   0.850000   0.000000   0.850000 (  0.863052)
one large loop      0.500000   0.000000   0.500000 (  0.494525)

This is really disappointing, because I was hoping to convince my team that we wouldn't see any reduced performance by splitting up our 1 large loop of code into several more concise loops which each did one thing and did it well.

I thought it might be due to the order in which the reports are generated, but when I reverse the order of the two calls to performance.report, I get similarly disappointing results:

Rehearsal -----------------------------------------------------
one large loop      0.500000   0.010000   0.510000 (  0.508246)
two smaller loops   0.850000   0.000000   0.850000 (  0.852467)
-------------------------------------------- total: 1.360000sec

                        user     system      total        real
one large loop      0.490000   0.000000   0.490000 (  0.496130)
two smaller loops   0.830000   0.000000   0.830000 (  0.831476)

Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?

In Big O these are both considered N time because we drop the constant (The "large" loop is 2N and the smaller is N). I would say it's most likely still worth splitting up - writing your code as huge expressions for the sake of performance sounds kinda miserable. And you gotta ask yourself whether 0.4 seconds over 10 million iterations really matters (especially considering that whatever actually happens in your loop is likely more time consuming than 2 + 2) — max pleaner, Jul 16 '20 at 03:21

Jörg W Mittag · Answer 1 · 2020-07-21T17:27:40.963

the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations

It doesn't make sense to talk about "operations" without defining the machine model and cost model within which we model those "operations". Or, to put it simply: it doesn't make sense to count things until you are clear what you are counting.

In this case, you are counting additions. And you are correct: within your model which counts only additions, both versions have the same amount of additions.

They do, however, not have the same amount of block activations.

Remember, Integer#times looks roughly like this:

class Integer
  def times
    return enum_for(__callee__) unless block_given?
    return self unless positive?

    i = -1
    yield i while (i += 1) < self

    self
  end
end

So, for each iteration of the loop, there is an activation (i.e. a yield) of the block passed to Integer#times.

If we add that as a new class of "operation", we have the following:

one_loop: 20,000,000 additions and 10,000,000 block activations
two_loops: 20,000,000 additions and 20,000,000 block activations

So, both methods have the same number of additions but two_loops has twice the number of block activations.

Which means, we also have to take into account the relative costs of additions versus block activations. Now, semantically, an addition is just a normal method call. And activating a block is sort-of similar to a method call.

So, we would expect that an addition and a block activation have roughly similar cost, which means our cost would be:

one_loop: 30,000,000 "method call like operations"
two_loops: 40,000,000 "method call like operations"

In other words, we would expect two_loops to be 33% slower or one_loop to be 25% faster, depending on how you look at it.

However, we actually find that the difference is much bigger, so clearly we are missing something in our model.

What we are missing are optimizations. Arithmetic operations on integers are extremely common and extremely performance-critical, and so all Ruby implementations go to great lengths to make them fast. In fact, on all Ruby implementations, simple additions such as the one you are using will be mapped directly to single CPU ADD instructions, and will not incur the overhead of a method call at all.

Block activations are also very important in Ruby, and thus they are also heavily optimized, but they are just fundamentally several orders of magnitude more complex than adding two machine-word integers.

In fact, the relative complexity of block activations to machine-word integer additions is so big, that we can actually ignore the additions completely in our model:

one_loop: 10,000,000 block activations
two_loops: 20,000,000 block activations

This gives us a factor of 2:1, so we would expect two_loops to be 100% slower or one_loop to be 50% faster.

By the way, I ignored another operation that is happening: the definition and initialization of the local variables. The argument is similar: that is an operation that is so fast that it is negligible compared to the block activation.

Actually, so far, we have only talked about the relative cost of those operations and how they mean that we can ignore the cost of the additions and the local variables. There is, however, an even stronger reason to ignore those: optimizations.

Even the most simple Ruby implementation will be able to optimize away the local variables completely: they are defined and initialized in only one place, and they are never accessed ever again. They only exist within the scope of the block, for the duration of one activation of the block, so even a very simple optimizer can see that they are completely useless, so even the most simple of optimizers will optimize the code to something roughly like this:

def one_loop
  N.times do
    1+1
    2+2
  end
end

def two_loops
  N.times do
    1+1
  end

  N.times do
    2+2
  end
end

Meaning that we can not only ignore the cost of the local variables because it is small compared to the other costs, but actually, the local variables don't even exist.

Likewise, a slightly smarter optimizer will recognize that the first addition in one_loop has no side-effects, is not returned, is not stored in a variable (or at least not in one that is used anywhere), and in general does not influence the outcome of the computation in any way, shape, or form, and thus will optimize the code to this:

def one_loop
  N.times do
    2+2
  end
end

def two_loops
  N.times do
    1+1
  end

  N.times do
    2+2
  end
end

Furthermore, the same argument actually applies to the remaining addition. It has no side-effect, all it does is being returned from the block, but Integer#times ignores the return value of the block. I haven't taken a look at the generated code, but I strongly suspect that even the most stupid optimizer can easily prove that your blocks are no-ops, and thus it will optimize the code to something roughly like this:

def one_loop
  N.times do
  end
end

def two_loops
  N.times do
  end

  N.times do
  end
end

Which means that one_loop has N iterations of the block, two_loops has 2 * N iterations, and thus should take roughly twice as long.

Now, we can see in your benchmarks that the numbers are not actually 2:1. They are 1.75:1 or roughly 7:4.

I can confirm those results on my machine, here with YARV 2.7.1 without JIT, I get almost exactly 7:4:

                       user     system      total        real
two smaller loops   0.711479   0.000099   0.711578 (  0.711680)
one large loop      0.401808   0.000059   0.401867 (  0.401916)

However, when I turn on the JIT, I get almost exactly the 2:1 we would expect:

                       user     system      total        real
two smaller loops   0.587017   0.000279   0.587296 (  0.587098)
one large loop      0.291713   0.000062   0.291775 (  0.291779)

You will also notice that it is faster in general.

With JRuby 9.2.9.0, we get again slightly faster execution and almost 2:1:

                       user     system      total        real
two smaller loops   0.740000   0.010000   0.750000 (  0.517670)
one large loop      0.260000   0.000000   0.260000 (  0.263270)

This was with default options, here are the results with some more aggressive compiler flags:

                       user     system      total        real
two smaller loops   0.370000   0.000000   0.370000 (  0.362050)
one large loop      0.390000   0.010000   0.400000 (  0.213861)

TruffleRuby 20.1.0 is again much faster than even JRuby:

                       user     system      total        real
two smaller loops   0.009955   0.000039   0.009994 (  0.010035)
one large loop      0.004759   0.000007   0.004766 (  0.004742)

And again, very close to 2:1. Also, even though we are only interested in the relative performance of the two methods, it is quite pleasing to see that TruffleRuby is 70x-100x faster than YARV on this benchmark!

Actually, I am somewhat surprised that TruffleRuby could not prove that Integer#times with an empty block body is a no-op. I would have expected it to be able to optimize the code like this:

def one_loop
end

def two_loops
end

And thus have no runtime difference between the two versions at all.

Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?

I would say all of the above.

The main problem is that you are measuring almost the exact opposite of what you are counting. You are counting only additions and ignoring block activations, and there is nothing wrong with that, IFF all you are interested in is the number of additions and nothing else.

And you are measuring only the cost of block activations and ignoring the cost of additions, which is also totally fine if that is what you are interested in.

The problem is that those two do not match up: you are not measuring what you are counting and you are not counting what you are measuring, so you simply cannot draw any conclusions from the results of your experiments towards your hypothesis.

In one of your comments, you asked:

so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop?

We can't tell you that. You need to define what operations you are interested in, and what operations you want to ignore. If you define "operation" to only mean "addition", then no, each iteration of the loop does not count as its own operation, and both of your examples have the exact same amount of operations.

Another problem is that your hypothesis "the number of additions is the same, therefore the execution time is the same" is invalid, because additions are not the only operations that take time. And even if you count other kinds of operations, then your hypothesis still assumes that every operation takes the same amount of time, which is also not true.

There are also some more problems with your benchmarking approach in general, which however are not the source of your confusion. Here are some of the problems with your benchmarks I have identified, although I am sure there others:

Your benchmarks are written in a way that all the operations you are interested about are optimized away, leaving only operations you are not interested about.
Even if they weren't optimized away, their execution time negligible compared to the execution time of the operations you don't care about.
Your benchmarks don't run long and often enough to give the optimizer a chance. For example, the default threshold for when a method is compiled is between 20 and 20000 calls depending on the Ruby implementation, compiler flags, etc. Both your methods are only called twice, once during rehearsal, once during the real thing. You would need to make sure they are called a lot more than 20000 times to ensure that a) they get compiled at all, and b) there are enough iterations after they have been compiled that the slower iterations before they were compiled do not significantly affect the outcome.

I always recommend that someone who wants to write a benchmark read and understand the following mailing list thread:

JMH vs Caliper: reference thread

Especially the sub-thread and following discussion starting at the linked message.

While this thread is about a specific benchmark tool for benchmarking Java code, anything that is discussed in the thread applies to all benchmarking on all modern high-performance language implementations.

There is a reason why benchmarks are written by benchmark engineers who have writing benchmarks as their full-time job: it takes a lot of knowledge and expertise to write benchmarks.

You need, at least

Deep knowledge of computer organization in general.
Deep knowledge of all the specific hardware platforms that you are benchmarking on.
Deep knowledge of programming languages in general.
Deep knowledge of al the specific programming languages you are writing benchmark code in.
Deep knowledge of language implementations in general, including but not limited to ahead-of-time compilers, JIT compilers, interpreters, VMs, garbage collectors, memory allocators, optimizers, inlining, loop unrolling, dead code elimination, constant folding, common subexpression elimination, tail call elimination, peephole optimizations, compile-time evaluation, Polymorphic Inline Caching, and lots more.
Deep knowledge of the specific language implementations you are running your code on.
And more, e.g. operating systems, scheduling, NUMA, multi-threading, SMT, CMT, …

And when you have all of that, you are presented with a bunch of numbers you need to know how to interpret, which requires deep knowledge of statistics.

The benchmark library in the Ruby stdlib is an example of many of those "sins". It was perfectly fine 25 years ago, when there was only one single Ruby implementation which was simply a dumb AST-walking interpreter with no optimizations whatsoever, and computers had one CPU with no out-of-order execution or speculation, and that was it. But for current times, where we have a multitude of aggressively optimizing Ruby implementations (most prominently TruffleRuby) and complex CPUs that perform their own optimizations as well, it just doesn't cut it anymore.

Unfortunately, there is no benchmark tool of comparable sophistication to what's available e.g. in the Java world, but there are at least some alternatives such as better-benchmark (no longer maintained), benchmark-ips (by Evan Phoenix, founder of Rubinius), or fruity (by Marc-André Lafortune, ruby-core team member).

TL:DR of some of this: micro-benchmarking is *hard*, especially anything involving an optimizing compiler rather than a simple interpreter, or things small enough that out-of-order execution by the underlying CPU becomes relevant. (Throughput and latency are different). But yeah, warm-up effects like dynamic CPU frequency, and page faults the first time you touch newly allocated memory, can be major factors, too. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) And other warm-up effects for JIT-compiled languages like most JVMs and some Ruby. — Peter Cordes, Jul 18 '20 at 23:43

score 1 · Accepted Answer · answered Jul 16 '20 at 01:37

1

This is 10 million iterations, and in each iteration two calculations are done, for a total of 30 million of what we'll call operations:

  N.times do
    foo = 1+1
    bar = 2+2
  end

This is 20 million iterations, and in each iteration one calculation is done, for a total of 40 million of what we'll call operations:

  N.times do
    foo = 1+1
  end

  N.times do
    bar = 2+2
  end

30 < 40, therefore the first example is faster.

answered Jul 16 '20 at 01:37

anothermh

9,815
3
33
52

1

@Richie Thomas A loop includes a decrementing a counter and testing it for > 0 so I'd say it counts for at least 2. Your problem is that in your example there's more overhead in maintaining the loop than there's work going on inside it. If you broke up a single loop with 100 operations into 2 with 50 the overhead would be negligible – Jim Castro Jul 16 '20 at 01:53
`This is 10 million iterations, and in each iteration two calculations are done, for a total of 30 million...` so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop? I was under the impression that this wasn't the case, at least when it pertains to compiling code down to assembly instructions. I'm sure it also depends on my architecture, which is why I included my hardware info and Ruby version. EDIT- deleted the above comment which used to come before @JimCastro's comment. – Richie Thomas Jul 16 '20 at 01:53
@JimCastro in that case, my example code doesn't match my real-world example. I'll re-benchmark with more content inside the two loops. Thanks! – Richie Thomas Jul 16 '20 at 01:55

Why are these two smaller loops so much slower than a single loop containing the same instructions?

2 Answers2