I have these basic types in my code:
newtype Foo (m :: Factored) = Foo Int64 deriving (NFData)
foo :: forall m . (Fact m) => Foo m -> Foo m
class T t where t :: (Fact m ) => t m -> t m
instance T Foo where t = foo
newtype Bar t (m :: Factored) = Bar (t m) deriving (NFData)
bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v
(ignore Fact
and Factored
for the moment). I'm benchmarking the code at different levels, comparing the performance of foo
, t
, and bar
. In the benchmarks, t = foo
, and bar
just applies t
through a newtype
. Thus their runtime should be essentially identical, but criterion reports that foo
takes 9.2ns, t
takes about twice that at 17.45ns, and bar
takes a whopping 268.1ns.
I've experimented with adding INLINABLE
and even a SPECIALIZE
pragma, but they aren't helping. I want to believe that GHC has some magic syntax/optimization/etc that can be consistently applied to solve these types of performance issues. For example, I've seen cases where writing code in point-free style has dramatic performance implications.
The complete code can be found here. I promise it's not intimidating. The modules are:
- Foo: defines
Foo
,foo
, andT
- Bar: defines
Bar
andbar
- FooBench: defines a benchmark for
foo
- TBench: defines a benchmark for
t
- BarBench: defines a benchmark for
bar
- Main: runs the three benchmarks
- Factored: defines
Fact
andFactored
using singletons
Most of the modules are tiny; I defined the three benchmarks in separate files so that I could examine their core. I generated the core for the three *Bench
modules and aligned them the best I could. They're only ~250 lines each, and the first ~200 lines are identical, up to renaming. The problem is that I don't know what to make of the last 50 or so lines. The (core) diff for FooBench
vs TBench
is here, the diff for TBench
vs BarBench
is here, and the diff for FooBench
vs BarBench
is here.
Just a few of the questions I have:
At a high level, what is the essential difference between the core files? I'm looking for something like "Here you can see that GHC isn't inlining the call to
x
." What should I be looking for?What can be done to make the three benchmarks all run in about 9.2ns? GHC optimizations?
INLINE
/INLINABLE
pragmas?SPECIALIZE
pragmas that I missed? (You aren't allowed to specialized forF128::Factored
; in my real library this value may be reified at runtime.) Restricting/delaying inlining to a particular phase?
Although I'm looking for an actual solution to make the benchmarks all fast, it's possible that tricks that work for this example won't scale to my real library. As a result, I'm also looking for the "high-level" explanation of why a particular technique should work.