why does F# inline cause 11x performance improvement

Question

I am working on some heavy cpu bound problem. I see a big performance improvement when I use the inline keyword. I create a dictionary from the standard .net library passing in a custom key Comparer see code and timing results below

https://gist.github.com/4409734

without inline keyword on Eq_cmp

> perf_run 10000000 ;;
Real: 00:00:11.039, CPU: 00:00:11.029, GC gen0: 771, gen1: 3, gen2: 1
val it : unit = ()

using inline keyword on Eq_cmp

perf_run 10000000 ;;
Real: 00:00:01.319, CPU: 00:00:01.388, GC gen0: 1, gen1: 1, gen2: 1
val it : unit = ()
>

I also noticed the huge difference in the amount of Gen 0 GC with the inlined code and non inlined code.

Could someone explain why there is such a huge difference?

You are surprised when using an optimization improves performance? This is completely expected behavior, although admittedly the size of the effect is comparatively large. — John Palmer, Dec 29 '12 at 23:11
Generic equality tests in F# are slow. I think this is essentially the same problem as the one discussed here: http://stackoverflow.com/questions/6104221/why-is-this-f-code-so-slow/6104300#6104300 — Tomas Petricek, Dec 30 '12 at 14:04

score 18 · Accepted Answer · edited Jun 20 '20 at 09:12

I can reproduce the behavior on my machine with 3x performance boost after adding inline keyword.

Decompiling two versions side by side under ILSpy gives almost identical C# code. The notable difference is in two equality tests:

// Version without inline
bool IEqualityComparer<Program.Pair<a>>.System-Collections-Generic-IEqualityComparer(Program.Pair<a> x, Program.Pair<a> y)
{
    a v@ = x.v@;
    a v@2 = y.v@;
    if (LanguagePrimitives.HashCompare.GenericEqualityIntrinsic<a>(v@, v@2))
    {
        a w@ = x.w@;
        a w@2 = y.w@;
        return LanguagePrimitives.HashCompare.GenericEqualityIntrinsic<a>(w@, w@2);
    }
    return false;
}

// Version with inline
bool IEqualityComparer<Program.Pair<int>>.System-Collections-Generic-IEqualityComparer(Program.Pair<int> x, Program.Pair<int> y)
{
    int v@ = x.v@;
    int v@2 = y.v@;
    if (v@ == v@2)
    {
        int w@ = x.w@;
        int w@2 = y.w@;
        return w@ == w@2;
    }
    return false;
}

The generic equality is much less efficient than the specialized version.

I also noticed the huge difference in the amount of Gen 0 GC with the inlined code and non inlined code.

Could someone explain why there is such a huge difference?

Taking a look at GenericEqualityIntrinsic function in F# source code:

let rec GenericEqualityIntrinsic (x : 'T) (y : 'T) : bool = 
    fsEqualityComparer.Equals((box x), (box y))

It does boxing on arguments, which explains the significant amount of garbage in your first example. When GC comes into play too often, it will slow down the computation dramatically. The second example (using inline) produces almost no garbage when Pair is struct.

That said, it is the expected behavior of inline keyword when a specialized version is used at the call site. My suggestion is always to try to optimize and measure your code on the same benchmarks.

You may be interested in a very similar thread Why is this F# code so slow?.

score 16 · Answer 2 · answered Dec 31 '12 at 17:38

16

Type specialization

Without inline, you are using generic comparison which is very inefficient. With inline, the genericity is removed and int comparison is used directly.

answered Dec 31 '12 at 17:38

J D

48,105
13
171
274

Interested to know how ocaml handles this, as there is no equivalent to inline AFAIK – isaiah_p Jan 04 '13 at 15:41
3

@user125 OCaml handles this very badly indeed. Every value that requires at least 1 word to store gets boxed by default (although there is a special case to unbox arrays of floats but that incurs run-time type tests on all arrays). Every generic function does polymorphism via run-time dispatch, which is slow. Inlining is naive at best (small leaf functions get inlined) but also impeded by functor boundaries. So OCaml's Hashtbl.t is an array of (heap-allocated) lists of (tagged) keys and values. Custom comparison and hashing means functors which means broken inlining. – J D Jan 04 '13 at 16:58
2

And finally, your program represents pathological behaviour for a generational garbage collector like the one in OCaml. Overall, I'd expect this to be well over 10x faster in F# than in OCaml. – J D Jan 04 '13 at 17:01
thanks, what is a good resource to get low level details on ocaml runtime – isaiah_p Jan 05 '13 at 19:08
2

FWIW, on this machine (4x 3.4GHz Intel Core i7-3770) filling an empty hash table with 10,000,000 int→int bindings is 12x faster in F# than OCaml and with float→float bindings is 23x faster in F# than OCaml. – J D Jan 07 '13 at 22:53

why does F# inline cause 11x performance improvement

2 Answers2

Linked