How to reliably influence generated code at near machine level using GHC?

Question

While this may sound as theoretical question, suppose I decide to invest and build a mission-critical application written in Haskell. A year later I find that I absolutely need to improve performance of some very thin bottleneck and this will require optimizing memory access close to raw machine capabilities.

Some assumptions:

It isn't realtime system - occasional latency spikes are tolerable (from interrupts, thread scheduling irregularities, occasional GC etc.)
It isn't a numeric problem - data layout and cache-friendly access patterns are most important (avoiding pointer chasing, reducing conditional jumps etc.)
Code may be tied to specific GHC release (but no forking)
Performance goal requires inplace modification of pre-allocated offheap arrays taking alignment into account (C strings, bit-packed fields etc.)
Data is statically bounded in arrays and allocations are rarely if ever needed

What mechanisms does GHC offer to perfom this kind of optimization? By saying reliably I mean that if source change causes code to no longer perform, it is correctible in source code without rewriting it in assembly.

Is it already possible using GHC-specific extensions and libraries?
Would custom FFI help avoid C calling convention overhead?
Could a special purpose compiler plugin do it through a restricted source DSL?
Could source code generator from a "high-level" assembly (LLVM?) be solution?

So you looked at the current machine code at the profiling hotspot, and found some missed optimizations? Did you try modifying the compiler-generated asm to check that what you think would be faster actually does run faster? Or at least have some static analysis to justify your conclusion? (And HW performance counter data to confirm your conclusion about why it's slow in the first place?) Yes it's often possible to beat the compiler *if* you know what you're doing (See my [Collatz conjecture asm optimization answer](https://stackoverflow.com/q/40354978)), but always measure. — Peter Cordes, Aug 30 '18 at 00:22
(And no, I don't know how to use GHC. I'm here for the `[micro-optimization]` tag, not Haskell, sorry. — Peter Cordes, Aug 30 '18 at 00:22
"Did you try modifying the compiler-generated asm to check that what you think would be faster actually does run faster?" => Roger that! This is about middle step towards this goal! I need to convince GHC to generate code that maps better to what I know it "should" be doing and with less noise. — sevo, Aug 30 '18 at 20:59

jberryman · Accepted Answer · 2018-08-30T22:27:56.000

It sounds like you're looking for unboxed arrays. "unboxed" in haskell-land means "has no runtime heap representation". You can usually learn whether some part of your code is compiled to an unboxed loop (a loop that performs no allocation), say, by looking at the core representation (this is a very haskell-like language, that's the first stage in compilation). So e.g. you might see Int# in the core output which means an integer which has no heap representation (it's gonna be in a register).

When optimizing haskell code we regularly look at core and expect to be able to manipulate or correct for performance regressions by changing the source code (e.g. adding a strictness annotation, or fiddling with a function such that it can be inlined). This isn't always fun, but will be fairly stable especially if you are pinning your compiler version.

Back to unboxed arrays: GHC exposes a lot of low-level primops in GHC.Prim, in particular it sounds like you want mutable unboxed arrays (MutableByteArray). The primitive package exposes these primops behind a slightly safer, friendlier API and is what you should use (and depend on if writing your own library).

There are many other libraries that implement unboxed arrays, such as vector, and which are built on MutableByteArray, but the point is that operations on that structure generate no garbage and likely compile down to pretty predictable machine instructions.

You might also like to check out this technique if you're doing numeric work and want to use a particular instruction or implement some loop directly in assembly.

GHC also has a very powerful FFI, and you can research about how to write portions of your program in C and interop; haskell supports pinned arrays among other structures for this purpose.

If you need more control than those give you then haskell is likely the wrong language. It's impossible to tell from your description if this is the case for your problem (Your requirements seem contradictory: you need to be able to write a carefully cache-tuned algorithm, but arbitrary GC pauses are okay?).

One last note: you can't rely on GHC's native code generator to perform any of the low-level strength reduction optimizations that e.g. GCC performs (GHC's NCG will probably never ever know about bit-twiddling hacks, autovectorization, etc. etc.). Instead you can try the LLVM backend, but whether you see a speedup in your program is by no means guaranteed.

That code sample looks very promising! Didn't know about pinned arrays! Many systems can live with *occasional* GC pauses when throughput is the primary measure or requests are idempotent and can be duplicated. I expect that keeping heap small can bring pauses down to acceptable levels. — sevo, Aug 31 '18 at 21:31

How to reliably influence generated code at near machine level using GHC?

1 Answers1