Avoiding polymorphism when re-using code as optimization

Question

One skill in programming is code re-use (DRY). However, as a function typically¹ is only compiled once, calling the same function with slightly different data will lead to polymorphism, even when the data never mixes per call - the shape of the data is partitioned to the call-sites. This is detrimental to inline caches, and other optimizations.

const f = a => {
  let diagonal = 0;
  for (let i = 0; i < 256; i += 16 + 1) diagonal += a[i].x;
  return diagonal;
};
// One place in code:
const a1 = Array.from({ length: 256 }, () => ({ x: 0 }));
f(a1);
// Different place, almost unrelated:
const a2 = Array.from({ length: 256 }, () => ({ x: 0, hiddenClassChange: 0 }));
f(a2);

As this pattern happens a lot, it also applies to bottlenecks. Is there something v8 already does to solve this, a misunderstanding on my side, or a way to solve it, other than ctrl+c ctrl+v on the textual representation of the function in the code base?

In the scenario that kicked off this question, two places have different ways of addressing (x, y) => data, while the iteration stays the same. I'd like to not copy&paste the iteration code, but if otherwise it can't inline the addressing schemes, potentially leading to even more missed optimizations (escape analysis, etc), I might have to.

[1]: When a function is inlined, can the code embedded into the outer function have a different inline cache than at another call-site? If that were the case, and it would do so in a predictable reliable way, this could solve the problem.

As was asked, here is an example. Note that this way of benchmarking isn't generally applicable, it should only serve to show "there is a difference". The real case is too large to post here.

const f = a => {
  let diagonal = 0;
  for (let i = 0; i < 256; i += 16 + 1) diagonal += a[i].x;
  return diagonal;
};
// Copy of `f`.
const g = a => {
  let diagonal = 0;
  for (let i = 0; i < 256; i += 16 + 1) diagonal += a[i].x;
  return diagonal;
};
// Could write this more compact, but this is faster to understand.
const a1 = Array.from({ length: 256 }, () => ({ x: 0, a: 0 }));
const a2 = Array.from({ length: 256 }, () => ({ x: 0, b: 0 }));
const a3 = Array.from({ length: 256 }, () => ({ x: 0, c: 0 }));
const a4 = Array.from({ length: 256 }, () => ({ x: 0, d: 0 }));
const a5 = Array.from({ length: 256 }, () => ({ x: 0, e: 0 }));
const a6 = Array.from({ length: 256 }, () => ({ x: 0, f: 0 }));
const a = [a1, a2, a3, a4, a5, a6];
// Prime `f` and `g`.
for (let i = 0; i < 100000; i++) {
  for (const an of a) f(an);
  g(a1);
}
// Very naive time measure.
let start = performance.now();
for (let i = 0; i < 100000; i++) f(a1);
console.log(performance.now() - start);
start = performance.now();
for (let i = 0; i < 100000; i++) g(a1);
console.log(performance.now() - start);

Why would you ever have to worry about something that specific? Do you have any proof of this being the specific cause of a performance problem? — AspectOfTheNoob, Jan 09 '23 at 19:07
can you approach this from a functional programing angle, rather than an object oriented angle? — This Guy, Jan 09 '23 at 19:36
The underlying problem is the same as in the linked question for copying a function. That it can be a difference is visible there, and not surprising. However, that question overly focuses on one possible solution. In a way, it shows an XY-problem. I more and more get the feeling there won't be a nice solution, but have been surprised before - better ask a question, than follow a wrong belief. — Doofus, Jan 09 '23 at 23:24
'Is there something v8 does' - v8 is a JIT-compiler aiming to run JavaScript fast, as JavaScripts is dynamic in nature the whole thing is 'doing something to make polymorphism fast' ... There is a whole zoo of things going on, inline caches (plural), speculative optimization, deoptimization, on stack replacement, ... — Jonas Wilms, Jan 10 '23 at 23:19
The assumption that 'copying code to avoid polymorphism increases performance' is a very bold one - after all that code has to be parsed, compiled and optimized twice. One just cannot make such general assumptions on a system that is largely based on heuristics, one can only look at what happens in a concrete example - which unfortunately was not part of the question. — Jonas Wilms, Jan 10 '23 at 23:23
Maybe relevant: https://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html — Jonas Wilms, Jan 10 '23 at 23:38
I've added an example, but it only writes out what has already been said. As mentioned, similar examples exist on the linked question. The point of interest is, whether there is a way to prevent this, other than copy&paste. Again, I assume there isn't, but before manually duplicating code, I rather ask. — Doofus, Jan 11 '23 at 09:20
Towards some points, "inline caches" are the primary reason for this. A related issue is not being able to inline, which however imho traces back to having too many different functions as input, and is therefore related to overwhelmed inline caches. "Speculative optimization" is a very broad term, and surely related, but not specific. "Deoptimization" shouldn't be the issue here. OSR is afaik also not related, unless you care about the loops in the benchmark itself. Adding it to a function with `%NeverOptimizeFunction` doesn't change the result as far as I can see. — Doofus, Jan 11 '23 at 09:27
PS: I don't know what this is called, or what to search for. If I have to use some name for it, "partitioned polymorphism" would be a guess. It happens every time two otherwise disjoint problems use the same utility function (code deduplication, DRY), and should therefore be very broadly applicable. As always with optimization, it's only important for bottlenecks (where, as common as it is, it also often appears). Perhaps I am mistaken, as there could be something I missed - given how common this is, I'd assume it was/is a topic of interest for v8. I'd be happy if my worries were unreasonable. — Doofus, Jan 11 '23 at 09:38

Jonas Wilms · Answer 1 · 2023-01-11T23:28:53.303

The example you have picked is a pretty extreme one, and indeed it shows a significant different in execution times. When looking at g one can see that V8 compiles the inner loop once to:

; for (...)
for_loop:
  movq r8,r12

  ; i < 256 (i is stored in register r11)
  cmpl r11,0x100
  jnc exit_loop
  movl r12,r11

  ; ensure a[i] is in bounds
  cmpl r11,rdi
  jnc index_out_of_bounds

  ; dereference a[i]
  movq r12,[rcx+r12*8+0xf]

  ; check that a[i] is not empty
  cmpq [r13+0x28] (root (the_hole_value)),r12
  jz hole_hit

  ; is it an smi (an int32)?
  testb r12,0x1
  jz no_smi

  ; a[i] is the shape ('map') we expect
  cmpq [r12-0x1],r9
  jnz wrong_map

  ; .x
  movl r12,[r12+0x1b]
  ; the addition
  addl r12,r8

  ; integer overflow?
  jo int_overflow
  
  ; i++
  addl r11,0x11

  ; StackGuard / For loop end
  cmpq rsp,[r13-0x20] (external value (StackGuard::address_of_jslimit()))
  ja for_loop
  jmp exit_guard

So this is pretty much as good as a loop in JavaScript can be compiled. The lookup of x is a simple pointer arithmetic. So you have basically created a best case example for an optimizing compiler. When we look at the second function g, first the compiler makes the same assumption, then has to deoptimize, then optimizes again with a more generic variant. In the end we end up with this in the inner loop:

movq rcx,0x1a18924f3ff9    ;; object: 0x1a18924f3ff9 <String[#1]: x>
movl rax,0x8
movq rdx,r9
movq rsi,[rbp-0x18]
movq r9,rcx
movq r11,rax
movq r10,0x1412ce0  (LoadICTrampoline_Megamorphic)
call r10

So basically here in every iteration of the loop, the property is looked up dynamically from the object. Still, the runtime difference for me is ~16ms vs. 6ms, so although this is twice as slow, this is only 10ms on 100000 iterations, so in real world usecases this difference won't matter. Now I've made a small change to your code, by only using four different objects shapes:

const a = [a1, a2, a3, a4, /* a5, a6 */];

And suddenly runtime drops to 6ms also for g, although it is accessed with two different object shapes. Compared to f, g only has this additional code:

  ; r14 is the object
  movq r14,[r12-0x1]
  cmpq r11,r14
  jz do_addition
  movq r15,0x5765057f991    ;; object: 0x05765057f991 <Map(HOLEY_ELEMENTS)>
  cmpq r15,r14
  jz do_addition
  movq r15,0x5765057f9d9    ;; object: 0x05765057f9d9 <Map(HOLEY_ELEMENTS)>
  cmpq r15,r14
  jz do_addition
  movq r15,0x5765057f901    ;; object: 0x05765057f901 <Map(HOLEY_ELEMENTS)>
  cmpq r15,r14
  jnz unknown_shape
do_addition:
  movl r12,[r12+0x1b]
  addl r12,r8

So what we can see here is that the objects in a1 to a4 all have the same shape and the property .x is always at the same offset +0x1b. The only 'overhead' we have is that instead of ensuring that the object is of one object shape, that it is one of the four shapes.

So we can see here that V8 is able to deal with polymorophism (up to a degree of 4) quite well, but for sure at some point dealing with more and more shapes hits the point were optimizing and finding similar shapes incurs such a high compile time overhead that it isn't worth the runtime overhead.

In real world code I would assume that the megamorphic scenario is usually in code that is not run very often (and is thus probably just interpreted and not even compiled) and rarely happens inside of a tight loop. When it happens inside a tight loop (i.e. you iterate thousands or millions of times over an array), I'd hope that on stack replacement kicks in and makes the loop fast again. So what you have created is pretty much the worst case scenario for V8 (tiny loop in large loop within a function call, with more than four shapes) and even there the performance impact is not that bad.

So all in all, I'd say that V8 is able to deal with polymorphism pretty well and you really have to create very artificial and edge case scenarios to see megamorphic problems.

I can only recommend reading What's up with monomorphism and I hope that someone can confirm my observations - as I have no idea what I'm talking about here ;)

Can confirm, that all sounds about right. – jmrk Jan 12 '23 at 14:54 — jmrk, Jan 12 '23 at 14:54

score 3 · Accepted Answer · answered Jan 12 '23 at 15:20

As a quick addition to @JonasWilms' answer:

Yes, this effect is a thing; no, there's no general way to avoid it, neither in the engine nor in user code. Flexible/dynamic/polymorphic code comes at a cost.

We've spent quite some time thinking about ways to improve there, but for several reasons it's hard to come up with a plan that's actually better across the board:

the differences aren't that big in practice
supporting every possible pattern is practically impossible, so it always boils down to drawing the line where the fast path gives up somewhere
the fact that JS is so dynamic (e.g.: functions can be overwritten at any time, prototypes and prototype chains can be modified at any time) makes it really difficult to get speedups through caching. "Difficult" here doesn't mean "engine developers are too lazy to do it", rather it means "the amount of checks you need to make sure the engine still behaves correctly when JS code does something weird is so large that these checks will eat the performance benefits that the fast path might otherwise have provided".

What makes us particularly sad is that you can run into a very similar issue with inheritance:

class Base {
  constructor(a) { this.a = a; }
  f() {
    let diagonal = 0;
    for (let i = 0; i < 256; i += 16 + 1) {
      diagonal += this.a[i];  // With enough subclasses, "this.a" becomes megamorphic.
    }
    return diagonal;
  };
}
class Sub1 extends Base { ... }
class Sub2 extends Base { ... }
class Sub2_1 extends Sub2 { ... }
// etc...

(new Sub1(...)).f();
(new Sub2(...)).f();
(new Sub2_1(...)).f();
// etc...

Depending on your situation, sometimes for particularly hot code this can be worked around by designing an object model that keeps some common parts monomorphic. For example, instead of deriving from a base class and adding this.* fields in subclasses, you add a single this.kind_specific_data field to the base class, and sub-types of objects use that to store a single object containing their specific data. Taking an AST to illustrate, instead of:

class Node {
  constructor() { this.id = GetNextNodeId(); }
}
class BinaryOperation extends Node {
  constructor(left, right) { super(); this.left = left; this.right = right; }
}

you'd do:

class Node {
  constructor(data = null) { this.id = GetNextNodeId(); this data = data; }
}
function MakeBinaryOperation(left, right) {
  return new Node({left, right});
}

and then algorithms iterating over your AST that only need to access Node IDs (and other common-to-all-nodes infrastructure) can be very fast because all accesses are monomorphic. You'll probably want to add some custom scheme to distinguish the various sub-types, maybe a .type property.

This pattern isn't always applicable, and can be more verbose than inheritance based designs. On the bright side, it can have unrelated benefits, such as improving modularity, testability, and reusability of your code. The performance differences can also be quite small, so in the majority of cases, there's no need to worry about it.

I must say I'm a bit surprised that access of base class properties is not always monomorphic - do these objects not all share a common hidden class that has `.a` in the first slot? — Bergi, Jan 12 '23 at 16:17
Thanks yet again, and sorry for using your time so often, but I want to get the details right on these topics. Towards Bergi's question, my chance to test my understanding: they share a common hidden-class ancestor. However, each subclass adding different instance properties (or in different order) results in a different hidden-class. Each subclass calling the method adds a case to the IC, one equality comparison, not for everything up to the root. A [previous question](https://stackoverflow.com/questions/74619246/polymorphism-overwhelming-inline-caches) is somewhat related. — Doofus, Jan 12 '23 at 17:37
Ha, yeah, I had forgotten about that other question. -- `s1 = new Sub1()` and `s2 = new Sub2()` will have different hidden classes. They'll each have `.a` as their first property (like the array elements in the OP), and they'll have a common indirect `Base` prototype somewhere up their prototype chain, but the objects themselves do have different hidden classes. — jmrk, Jan 12 '23 at 18:56

Avoiding polymorphism when re-using code as optimization

2 Answers2

Linked