5

How do I optimize numerical integration routine (comparing to C)?

What has been done to the moment:

  1. I replaced lists with unboxed vectors (obvious).
  2. I applied profiling techniques described in the book "Read World Haskell" http://book.realworldhaskell.org/read/profiling-and-optimization.html. I have inlined some trivial functions and inserted a lot of bangs everywhere. That gave about 10x speedup.
  3. I refactored the code (i.e. extracted iterator function). That gave 3x speedup.
  4. I tried to replace polymorphic signatures with Floats as in the answer to this question Optimizing numerical array performance in Haskell. That gave almost 2x speedup.
  5. I compile like this cabal exec ghc -- Simul.hs -O2 -fforce-recomp -fllvm -Wall
  6. UPDATE As suggested by cchalmers, type Sample = (F, F) was replaced with data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F

The performance now is almost as good as C code. Can we do better?

{-# LANGUAGE BangPatterns #-}

module Main
  where

import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import qualified Control.Monad.Primitive as PrimitiveM

import Dynamics.Nonlin ( birefrP )

type F = Float
type Delay = U.Vector F
type Input = U.Vector F
-- Sample can be a vector of any length (x, y, z, ...)
data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F
-- Pair is used to define exactly a pair of values
data Pair = Pair {-# UNPACK #-} !F {-# UNPACK #-} !F

type ParametrizedDelayFunction = (Sample, F) -> Sample

getX :: Sample -> F
getX (Sample a _) = a
{-# INLINE getX #-}

toDelay :: [F] -> Delay
toDelay = U.fromList

stepsPerNode :: Int
stepsPerNode = 40  -- Number of integration steps per node

infixl 6 ..+..
(..+..) :: Sample -> Sample -> Sample
(..+..) (Sample x1 y1) (Sample x2 y2) = Sample (x1 + x2) (y1 + y2)
{-# INLINE (..+..) #-}

infixl 7 .*..
(.*..) :: F -> Sample -> Sample
(.*..) c (Sample x2 y2) = Sample (c * x2) (c * y2)
{-# INLINE (.*..) #-}

-- | Ikeda model (dynamical system, DDE)
ikeda_model2
  :: (F -> F) -> (Sample, F) -> Sample
ikeda_model2 f (!(Sample x y), !x_h) = Sample x' y'
  where
    ! x' = recip_epsilon * (-x + (f x_h))
    y' = 0
    recip_epsilon = 2^(6 :: Int)

-- | Integrate using improved Euler's method (fixed step).
--
-- hOver2 is already half of step size h
-- f is the function to integrate
-- x_i is current argument (x and y)
-- x_h is historical (delayed) value
-- x_h2 it the value after x_h
heun2 :: F -> ParametrizedDelayFunction
  -> Sample -> Pair -> Sample
heun2 hOver2 f !x !(Pair x_h x_h2) = x_1
  where
    ! f1 = f (x, x_h)
    ! x_1' = x ..+.. 2 * hOver2 .*.. f1
    ! f2 = f (x_1', x_h2)
    ! x_1 = x ..+.. hOver2 .*.. (f1 ..+.. f2)


initialCond :: Int -> (Sample, Delay, Int)
initialCond nodesN = (initialSampleXY, initialInterval, samplesPerDelay)
  where cdi = 1.1247695e-4 :: F  -- A fixed point for birefrP
        initialInterval = U.replicate samplesPerDelay cdi
        samplesPerDelay = nodesN * stepsPerNode
        initialSampleXY = Sample 0.0 0.0

integrator
  :: PrimitiveM.PrimMonad m =>
    (Sample -> Pair -> Sample)
    -> Int
    -> Int
    -> (Sample, (Delay, Input))
    -> m (Sample, U.Vector F)
integrator iterate1 len total (xy0, (history0, input)) = do
    ! v <- UM.new total
    go v 0 xy0
    history <- U.unsafeFreeze v
    -- Zero y value, currently not used
    let xy = Sample (history `U.unsafeIndex` (total - 1)) 0.0
    return (xy, history)
  where
    h i = history0 `U.unsafeIndex` i
    go !v !i !xy
      -- The first iteration
      | i == 0 = do
        let !r = iterate1 xy (Pair (h 0) (h 1))
        UM.unsafeWrite v i (getX r)
        go v 1 r
      | i < len - 1 = do
        let !r = iterate1 xy (Pair (h i) (h $ i + 1))
        UM.unsafeWrite v i (getX r)
        go v (i + 1) r
      | i == total = do
        return ()
      -- Iterations after the initial history has been exhausted
      | otherwise = do
        ! newX0 <- if i == len - 1
                      then return (getX xy0)
                      else UM.unsafeRead v (i - len - 1)
        ! newX <- UM.unsafeRead v (i - len)
        let !r = iterate1 xy (Pair newX0 newX)
        UM.unsafeWrite v i (getX r)
        go v (i + 1) r

-- Not used in this version
zero :: Input
zero = U.fromList []

nodes :: Int
nodes = 306

main :: IO ()
main = do
  let delays = 4000
      (sample0, hist0, delayLength) = initialCond nodes
      -- Iterator implements Heun's schema
      iterator = heun2 (recip 2^(7::Int) :: F) (ikeda_model2 birefrP)
      totalComputedIterations = delayLength * delays

  -- Calculates all the time trace
  (xy1, history1) <- integrator iterator delayLength totalComputedIterations (sample0, (hist0, zero))
  putStrLn $ show $ getX xy1

  return ()

The nonlinear function (imported) can look like this:

data Parameters = Parameters { beta :: Float
                             , alpha :: Float
                             , phi :: Float } deriving Show
paramA :: Parameters
paramA = Parameters { beta = 1.1
                    , alpha = 1.0
                    , phi = 0.01 }

birefr :: Parameters -> Float -> Float
birefr par !x = 0.5 * beta' * (1 - alpha' * (cos $ 2.0 * (x + phi')))
  where
    ! beta' = beta par
    ! alpha' = alpha par
    ! phi' = phi par

birefrP :: Float -> Float
birefrP = birefr paramA
Community
  • 1
  • 1
penkovsky
  • 893
  • 11
  • 14
  • can you put the C counterpart somewhere? – Random Dev Nov 08 '15 at 17:02
  • Yes and no. The C code is compiled with mex so that it can be accessible in Matlab. Yes, I think it should not be a problem. No, it was written not by me, it is not very readable and I cannot put the matlab code using it. – penkovsky Nov 08 '15 at 17:06
  • So here is the C code http://pastebin.com/TgHbqaRs. It is different in that sense it uses external input and rk4 routine (but then it must be even slower). – penkovsky Nov 08 '15 at 17:16
  • 5
    Replacing `Sample` with `data Sample = Sample {-# UNPACK #-} !F {-# UNPACK #-} !F` makes it 2-3x faster for me. Code at http://lpaste.net/144896 – cchalmers Nov 08 '15 at 17:19
  • For me too. Thank you! – penkovsky Nov 08 '15 at 17:37
  • 3
    The general pattern for optimizing numeric code is to make numeric fields strict then compile with `-funbox-strict` (or add annotations as cchalmers suggests), make sure code can be inlined (possibly marking as INLINE), then dump core and make sure you're seeing all primitive unboxed functions and values (with a `#` in their name). If you're seeing some boxed types go back and add more `!` and moar INLINE, repeat – jberryman Nov 08 '15 at 22:28
  • Here is for example dump for the most "heavy" function ```integrator``` http://pastebin.com/YZAFKsPc. I am not sure about the optimization information it can provide. – penkovsky Nov 09 '15 at 00:16
  • 1
    Is the [FFI](https://wiki.haskell.org/Foreign_Function_Interface) not applicable in your case? – mucaho Nov 09 '15 at 01:01
  • FFI is not not applicable. But using FFI is almost the same as to say 'it is impossible to have fast code in Haskell'. Or to say 'it is much faster to write code in C than optimize Haskell code'. – penkovsky Nov 09 '15 at 09:14
  • @penkovsky There is no shame in doing performance-critical stuff in a language that is designed for performance-critical stuff. If you are writing a numerical integration library for haskell or just for fun / learning, then I understand your point. If you need numerical integration as a part of a bigger application in production, which does not provide that strict performance guarantees - so that you may use haskell as the uppermost abstraction layer, why would you bother not using FFI (since you already have C code)? Does that make sense? :) – mucaho Nov 09 '15 at 22:12
  • @mucaho That C code works only for 1D systems while I'd like to have both 1D and 2D and spend less time for debugging. – penkovsky Nov 10 '15 at 18:04
  • Very cool that you aimed at writing optimal numerical integration code purely in haskell. Did you actually manage to make your code equally fast to the C code (in the 1D case) and how much is the difference in percent? – exchange Aug 21 '22 at 14:35
  • Thanks to the comments above, I was able to achive a reasonable performance. Here are some benchmarks for Haskell code https://github.com/masterdezign/dde/blob/master/bench/Bench.hs. This should help to compare with similar implementations in other programming languages. – penkovsky Aug 22 '22 at 09:11

0 Answers0