Why the bounds check doesn't get eliminated?

Question

I wrote a simple benchmark in order to find out if bounds check can be eliminated when the array gets computed via bitwise and. This is basically what nearly all hash tables do: They compute

h & (table.length - 1)

as an index into the table, where h is the hashCode or a derived value. The results shows that the bounds check don't get eliminated.

The idea of my benchmark is pretty simple: Compute two values i and j, where both are guaranteed to be valid array indexes.

i is the loop counter. When it gets used as array index, the bounds check gets eliminated.
j gets computed as x & (table.length - 1), where x is some value changing on each iteration. When it gets used as array index, the bounds check does not get eliminated.

The relevant part is as follows:

for (int i=0; i<=table.length-1; ++i) {
    x += result;
    final int j = x & (table.length-1);
    result ^= i + table[j];
}

The other experiment uses

    result ^= table[i] + j;

instead. The difference in timing is maybe 15% (pretty consistently across different variants I've tried). My questions:

Are there other possible reasons for this besides bound check elimination?
Is there some complicated reason I can't see why there's no bound check elimination for j?

A summary of the answers

MarkoTopolnik's answer shows that it's all more complicated and the elimination of the bounds checks is not guaranteed to be a win, especially on his computer the "normal" code is slower than "masked". I guess this is because of it allowing some additional optimization which shows to be actually detrimental in this case (given the complexity of the current CPUs, the compiler hardly even knows for sure).

leventov's answer shows clearly that the array bounds check gets done in "masked" and that it's elimination makes the code as fast as "normal".

Donal Fellows points to the fact, that the masking doesn't work for a zero-length table, as x & (0-1) equals to x. So the best thing the compiler can do is to replace the bound check by a zero-length check. But this is IMHO still worth it, as the zero-length check can be moved out of the loop easily.

Proposed optimization

Because of the the equivalence a[x & (a.length - 1)] throws if and only if a.length == 0, the compiler can do the following:

For each array access, check if the index has been computed via a bitwise and.
If so, check if either of the operands was computed as length minus one.
If so, replace the bounds check by a zero-length check.
Let the existing optimizations take care of it.

Such an optimization should be pretty simple and cheap as it only looks at the parent nodes in the SSA graph. Unlike many complex optimizations, it can never be detrimental, as it only replaces one check by a slightly simpler one; so there's no problem, not even if it can't be moved out of the loop.

I'll post this to the hotspot-dev mailing lists.

News

John Rose filed an RFE and there's already a "quick-and-dirty" patch.

Was this famous serial performance question downvoter or someone willing to drop a note? — maaartinus, Feb 11 '14 at 13:25
I see a possible reason: `table[i]` results in a sequential access pattern, whereas with `table[j]` it's more irregular. Just one or two cache misses could be enough to account for 15% difference. — Marko Topolnik, Feb 11 '14 at 13:28
@MarkoTopolnik: I forgot to mention that the whole table surely fits in L1 (1024 entries only). — maaartinus, Feb 11 '14 at 13:31
There may be other low-level optimizations for sequential access, like fetching the next value before the pipeline reaches the stage where the actual index is known. — Marko Topolnik, Feb 11 '14 at 13:32
How can you be sure the bounds check doesn't get eliminated when you don't compute the index with `& table.length - 1`? — SpaceTrucker, Feb 11 '14 at 13:37
@MarkoTopolnik: "fetching the next value before the pipeline reaches" - sounds interesting. "confirmed the elimination" - sort of... the output of `PrintAssembly` is pretty unreadable, but I can see `cmp %r13d,%edx, jae ...` present in `timeMasked` only. — maaartinus, Feb 11 '14 at 13:45
Apart from the eliminated bounds check, I see more radical optimization in the "normal index" case. The loop is unrolled in both casses, but in the normal case it is also staged: first the values are bulk-loaded into various registers, then the registers are xor-ed into the accumulator. — Marko Topolnik, Feb 11 '14 at 14:01
BTW The option `-XX:CompileCommand=print,*Benchmark.time*`, apart from filtering out everything you're not interested in, gives a nicer printout (doesn't show placeholders for actual register names, for one). — Marko Topolnik, Feb 11 '14 at 14:06
To my surprise, I actually get better speed for the *masked* case! I see an 8% advantage on it (measuring for 10 seconds after 5 seconds warmup). — Marko Topolnik, Feb 11 '14 at 14:13
This [link](https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination) tends to suggest the check is eliminated by HotSpot only when "the array is indexed by linear functions of the index variable". — ochedru, Feb 11 '14 at 14:36
@SpaceTrucker: I'm not sure, it's just my assumption. But the by ochedru suggest it's true. — maaartinus, Feb 11 '14 at 14:56
@MarkoTopolnik: That's pretty strange, could you post your code somewhere? Concerning "fetching the next value" mentioned above: I replaced `x += i` by `x += 1`, so that the access is sequential except for a single wrap around, but not much changes. I also tried to eliminate `x` and set `j = i & (table.length-1)`, which is equivalent to `j = i`, but seems to prevent the bound check elimination. — maaartinus, Feb 11 '14 at 15:01
@MarkoTopolnik: Benchmark Mode Samples Mean Mean error Units o.o.j.s.JmhBoundsCheckBenchmark.maskedIndex avgt 20 1148.463 2.447 ns/op o.o.j.s.JmhBoundsCheckBenchmark.normalIndex avgt 20 1008.069 9.905 ns/op - I just renamed and run you benchmark and it confirms my results. — maaartinus, Feb 11 '14 at 15:17
So we can conclude that the difference is down to a) JVM version/platform, b) difference in CPUs. Anyway, I wouldn't go much further chasing this, given that the difference is small to begin with, and unstable across configurations. — Marko Topolnik, Feb 11 '14 at 15:19
Have you tried `x % (table.length-1)` instead of `x & (table.length-1)`? Maybe the compiler is not smart enough to figure out bounds of bitwise and at compile time. — SnakE, Feb 11 '14 at 15:25
@MarkoTopolnik: Yes, once 10% faster and once 10% slower, this sounds rather boring. I guess I should update my Java first. — maaartinus, Feb 11 '14 at 15:29
@SnakE: No, but with `%` you have no guarantee to get a valid index as it stays negative for negative `x`. — maaartinus, Feb 11 '14 at 15:31
You would have to dump the JITC-generated code to see. And even a given JITC may give different results depending on subtle differences. — Hot Licks, Feb 11 '14 at 17:11
@MarkoTopolnik: Didn't you confuse `ns/op` with `ops/ns` as I usually do? This is about the only explanation I can think of for bounds-checking code being faster. If not, would you mind to try my third experiment pair (a simplified version with `i==j`, just don't tell it the JVM :D). — maaartinus, Feb 11 '14 at 17:47
Definitely didn't confuse them. http://pastebin.com/NcC0JbwK and the results: `time3Masked 805.872 ns/op; time3Normal 945.730 ns/op`. — Marko Topolnik, Feb 11 '14 at 18:03
Totaly agree with the `%` suggestion - did you guarantee that the array length is a power of 2? otherwise you `&` doesn't make much sense, and definitely isn't enough to make your accesses sequential. — Leeor, Feb 11 '14 at 19:01
@Leeor: I totally disagree with `%`. It's such a slow operation that hardly anybody uses it for hash tables. In order to ensure a valid index, you'd need to add some branching and this would make the pattern hard for the compiler to recognize. Concerning `&` I need a power of two for it to *make sense*, but I don't need it for the *index to be valid* (the only thing I need is a non-empty array). — maaartinus, Feb 11 '14 at 19:13
@maaartinus " hardly anybody uses `%` it for hash tables" -- Trove does, I [do](https://github.com/OpenHFT/UntitledCollectionsProject). — leventov, Feb 11 '14 at 19:21
@maaartinus, I didn't mean he should use it in the optimized version, just for debugging, to make sure the reason is not doing `&` with some length like `100000` — Leeor, Feb 11 '14 at 19:30
@leventov: I hope you know what you're doing. And so do the Trove guys, but this may have historical reasons. Just note that I don't know about any use of it in Guava and AFAIK the only use in JDK is in `String.intern`. Division is dead slow with tens of cycles while you can do up to four simple instructions in a single cycle. — maaartinus, Feb 11 '14 at 20:38
@leventov: See e.g. the [`WhitespaceMatcherBenchmark`](http://code.google.com/p/guava-libraries/source/browse/guava-tests/benchmark/com/google/common/base/WhitespaceMatcherBenchmark.java) and its [results](https://microbenchmarks.appspot.com/runs/8a7f91d8-408b-4619-ae06-9091bb86f01e) for how costly division is. — maaartinus, Feb 11 '14 at 20:40
@Leeor: In my benchmark linked above the length is fixed, so nothing can go wrong. And as I said, with length being no power of two, the results must stay the same (for a length like 100000, they'd change due to cache misses). — maaartinus, Feb 11 '14 at 20:43
@MarkoTopolnik: I guess, I wasn't clear: This wasn't about relating cache misses to power-of-two sizes but to array length. In the benchmark above, arrays of size 1024 or 1111 must perform the same (and for both there are no cache misses as they nicely fit). For a big array there'd be cache misses and prefetching would work for `i` but (probably) not for `j` (see also your first comment here). — maaartinus, Feb 11 '14 at 21:10
They won't perform the same (except maybe if they fit in the L1 cache), because the HW prefetchers can handle sequential streams. If your array size is 1111, your `&` operation would get a complicated mask with some zeros masking off parts of your dataset, so you'll get both an irregular stream, and a subset of your entire dataset — Leeor, Feb 11 '14 at 23:34
@Leeor: But in this benchmark this doesn't matter as everything fits in L1 (and the benchmarks starts with warm cache). AFAIK, when accessing L1, the pattern doesn't matter as each L1 access takes the same time. The pattern is very important when accessing other levels because of the [prefetchers](http://software.intel.com/en-us/blogs/2009/08/24/what-you-need-to-know-about-prefetching). — maaartinus, Feb 12 '14 at 00:03

score 5 · Answer 1 · edited Jun 20 '20 at 09:12

To start off, the main difference between your two tests is definitely in bounds check elimination; however, the way this influences the machine code is far from what the naïve expectation would suggest.

My conjecture:

The bounds check figures more strongly as a loop exit point than as additional code which introduces overhead.

The loop exit point prevents the following optimization which I have culled from the emitted machine code:

the loop is unrolled (this is true in all cases);
additionaly, the fetching from the array stage is done first for all unrolled steps, then the xoring into accumulator is done for all the steps.

If the loop can break out at any step, this staging would result in work performed for loop steps which were never actually taken.

Consider this slight modification of your code:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.N)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
 public class Measure {
  public static final int N = 1024;

  private final int[] table = new int[N];
  @Setup public void setUp() {
    final Random random = new Random();
    for (int i = 0; i < table.length; ++i) {
      final int x = random.nextInt();
      table[i] = x == 0? 1 : x;
    }
  }
  @GenerateMicroBenchmark public int normalIndex() {
    int result = 0;
    final int[] table = this.table;
    int x = 0;
    for (int i = 0; i <= table.length - 1; ++i) {
      x += i;
      final int j = x & (table.length - 1);
      final int entry = table[i];
      result ^= entry + j;
      if (entry == 0) break;
    }
    return result;
  }
  @GenerateMicroBenchmark public int maskedIndex() {
    int result = 0;
    final int[] table = this.table;
    int x = 0;
    for (int i = 0; i <= table.length - 1; ++i) {
      x += i;
      final int j = x & (table.length - 1);
      final int entry = table[j];
      result ^= i + entry;
      if (entry == 0) break;
    }
    return result;
  }
}

There is just one difference: I have added the check

if (entry == 0) break;

to give the loop a way to exit prematurely on any step. (I also introduced a guard to ensure no array entries are actually 0.)

On my machine, this is the result:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Measure.maskedIndex     avgt         5        1.378        0.229    ns/op
o.s.Measure.normalIndex     avgt         5        0.924        0.092    ns/op

the "normal index" variant is substantially faster, as generally expected.

However, let us remove the additional check:

// if (entry == 0) break;

Now my results are these:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Measure.maskedIndex     avgt         5        1.130        0.065    ns/op
o.s.Measure.normalIndex     avgt         5        1.229        0.053    ns/op

"Masked index" responded predictably (reduced overhead), but "normal index" is suddenly much worse. This is apparently due to a bad fit between the additional optimization step and my specific CPU model.

My point:

The performance model at such a detailed level is very unstable and, as witnessed on my CPU, even erratic.

You think that this "the fetching from the array stage is done first for all unrolled steps" is the culprit, right? Interesting! — maaartinus, Feb 12 '14 at 17:29
Ideally, we should compare notes. The above technique isolates the effect of BCE from the effect of the additional staging optimization, so it would be interesting to see what it does on your side. — Marko Topolnik, Feb 12 '14 at 18:03
Yes, we should. The comments here are rather unsuitable for this. I think, it could be an interesting question, would you mind posting it? Otherwise, please send me an email to @gmail.com. — maaartinus, Feb 12 '14 at 18:33
I have posted it as a question: http://stackoverflow.com/questions/21738690/strange-jit-pessimization-of-a-loop-idiom — Marko Topolnik, Feb 12 '14 at 20:19

score 3 · Accepted Answer · answered Feb 11 '14 at 17:28

3

No, this is evidently an effect of not enough smart bounds check elimination.

I've extended a benchmark by Marko Topolnik:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(BCElimination.N)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class BCElimination {
    public static final int N = 1024;
    private static final Unsafe U;
    private static final long INT_BASE;
    private static final long INT_SCALE;
    static {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            U = (Unsafe) f.get(null);
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }

        INT_BASE = U.arrayBaseOffset(int[].class);
        INT_SCALE = U.arrayIndexScale(int[].class);
    }

    private final int[] table = new int[BCElimination.N];

    @Setup public void setUp() {
        final Random random = new Random();
        for (int i=0; i<table.length; ++i) table[i] = random.nextInt();
    }

    @GenerateMicroBenchmark public int normalIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= table[i] + j;
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= i + table[j];
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndexUnsafe() {
        int result = 0;
        final int[] table = this.table;
        long x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i * INT_SCALE;
            final long j = x & ((table.length-1) * INT_SCALE);
            result ^= i + U.getInt(table, INT_BASE + j);
        }
        return result;
    }
}

Results:

Benchmark                                Mean   Mean error    Units
BCElimination.maskedIndex               1,235        0,004    ns/op
BCElimination.maskedIndexUnsafe         1,092        0,007    ns/op
BCElimination.normalIndex               1,071        0,008    ns/op

2. The second question is for hotspot-dev mailing lists rather than StackOverflow, IMHO.

answered Feb 11 '14 at 17:28

leventov

14,760
11
69
98

Except that I checked the machine code and 1) `maskedIndex` had the bounds check where `normalIndex` didn't; 2) `normalIndex` used apparently streamlined code where the loop was both unrolled and reordered into two stages; 3) `maskedIndex` was nevertheless *faster* on my machine (by 8%). – Marko Topolnik Feb 11 '14 at 17:42
@MarkoTopolnik that is really strange, but more about pecularities of your CPU behaviour, than the subject of this question – leventov Feb 11 '14 at 17:52
You mean, this question can be separated from the peculiarities of CPU behavior? That's a strange thought... anyway, these are my results: `maskedIndex 1.152 ns/op; maskedUnsafeIndex 1.116 ns/op; normalIndex 1.220 ns/op.` Normal indexing is still the slowest on my machine. – Marko Topolnik Feb 11 '14 at 17:53
Strange that I didn't think about using `Unsafe` to check my conjecture! – maaartinus Feb 12 '14 at 17:32
@maaartinus Note that introducing Unsafe doesn't really pinpoint the issue because different instructions are used to calculate the array offset (among other differences). Each just produces different machine code without a clear-cut way of comparing them. – Marko Topolnik Feb 12 '14 at 18:38
@MarkoTopolnik: That's true, but you can make it much more similar using `U.getInt(table, INT_BASE + j * INT_SCALE)` and this can be easily translated to exactly the same code. I haven't checked if it actually does. – maaartinus Feb 12 '14 at 19:50
@maaartinus `U.getInt(table, INT_BASE + j * INT_SCALE)` works *slower* probably because JIT compiles array indexing into a single compute-address-and-memory-op command, where scale should be very small immediate (max 3 bits in assembly command code), but when you write `* INT_SCALE`, it doesn't know that `INT_SCALE` is smaller than 8, and compiles this construction in a couple of commands: `mul`, then offset-and-memory-op. Be careful. I didn't try `* 4L` neither to look into assembly (I'm lazy), though. – leventov Feb 12 '14 at 20:11
@leventov The way you wrote it also ends up as a `mul`. – Marko Topolnik Feb 12 '14 at 20:18
My benchmark's results showed no difference after this change. As `INT_SCALE` is `private static final`, so I hope the JIT can do whatever it wants. I'm too lazy too. The safe way would be to set it to `4` and throw in the initializer if the JVM is too exotic. @MarkoTopolnik: Yes, but it's independent of `x` and can be moved out of the loop. – maaartinus Feb 12 '14 at 20:51
In `x += i * INT_SCALE`, for each `i` there is a `mul` instruction. It can't be hoisted. Note however that `mul` by 4 is executed as shift left by 2, therefore it is very fast. – Marko Topolnik Feb 12 '14 at 20:53
@maaartinus Marko is right, my previous comment was removed due to obscene language. My assumption was that multiplying before masking lefts more flexibility for CPU to arrange instructions on arithmetic pipeline. – leventov Feb 12 '14 at 21:22

score 1 · Answer 3 · answered Feb 12 '14 at 08:15

1

In order to safely eliminate that bounds check, it is necessary to prove that

h & (table.length - 1)

is guaranteed to produce a valid index into table. It won't if table.length is zero (as you'll end up with & -1, an effective-noop). It also won't usefully do it if table.length is not a power of 2 (you'll lose information; consider the case where table.length is 17).

How can the HotSpot compiler know that these bad conditions are not true? It has to be more conservative than a programmer can be, as the programmer can know more about the high-level constraints on the system (e.g., that the array is never empty and always as a number of elements that is a power-of-two).

answered Feb 12 '14 at 08:15

Donal Fellows

133,037
18
149
215

I don't understand your comment about powers of 2. If `h` and `k` are nonnegative integers, then `h & k` is a nonnegative integer that is at most `h` and at most `k`. – ruakh Feb 12 '14 at 08:23
@ruakh That's technically not a safety condition, but it can produce terrible distributions. Consider the case with 17 buckets; you'll end up with everything going into either bucket 0 or (rarely) 16. The _only_ case where `h&(ary.length-1)` works well is when the array is where the array's size is a power of two (>=1), and the compiler isn't provided an easy proof of this. – Donal Fellows Feb 12 '14 at 10:17
2

I don't follow. If it's "technically not a safety condition", and the compiler's goal is merely "to safely eliminate that bounds check", then how is it relevant to the compiler? Why does the compiler need to be able to prove it? – ruakh Feb 12 '14 at 16:33
1

@ruakh: Agreed, the compiler shouldn't care. It can replace the bounds check by the `table.length > 0` check and let the worries about the distribution for the programmer. – maaartinus Feb 12 '14 at 16:59