18

This is a follow-up to my previous question on which I asked why stream fusion wasn't kicking in a certain program. Turns out the problem was that certain functions weren't inlined, and an INLINE flag improved the performance by about 17x (which showcases the importance of inlining!).

Now, notice that, on the original question, I hardcoded 64 calls of incAll at once. Now, suppose that, instead, I create an nTimes function, which calls a function repeatedly:

module Main where

import qualified Data.Vector.Unboxed as V

{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)

{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a
nTimes 0 f x = x
nTimes n f x = f (nTimes (n-1) f x)

main :: IO ()
main = do
  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  print $ V.sum (nTimes 64 incAll array)

In this case, just adding an INLINE pragma to nTimes won't help, because AFAIK GHC doesn't inline recursive functions. Is there any trick to force GHC to expand nTimes at compile time and, thus, recover expected performance?

Community
  • 1
  • 1
MaiaVictor
  • 51,090
  • 44
  • 144
  • 286
  • 2
    You could use Template Haskell to introduce syntax to expand repeated application. – Joachim Breitner Feb 11 '17 at 20:37
  • 1
    @JoachimBreitner just finished doing that. Had to learn Template Haskell. Still testing my answer, but it seems a lot faster (similar to the other question). – Zeta Feb 11 '17 at 21:29

3 Answers3

27

No, but you can use better functions. I'm not talking about V.map (+64), which would make things certainly a lot faster, but about nTimes. We have three candidates that already do what nTimes does:

{-# INLINE nTimesFoldr #-}
nTimesFoldr :: Int -> (a -> a) -> a -> a    
nTimesFoldr n f x = foldr (.) id (replicate n f) $ x

{-# INLINE nTimesIterate #-}
nTimesIterate :: Int -> (a -> a) -> a -> a    
nTimesIterate n f x = iterate f x !! n

{-# INLINE nTimesTail #-}
nTimesTail :: Int -> (a -> a) -> a -> a    
nTimesTail n f = go n
  where
    {-# INLINE go #-}
    go n x | n <= 0 = x
    go n x          = go (n - 1) (f x)

All versions take around 8 seconds, compared to the 40 seconds your versions take. Joachim's version also takes 8 seconds, by the way. Note that the iterate version takes more memory on my system. While there is an unroll plugin for GHC, it hasn't been updated in the last five years (it uses custom ANNotations).

No unroll at all?

However, before we despair, how well does GHC actually try to inline everything? Let's use nTimesTail and nTimes 1:

module Main where
import qualified Data.Vector.Unboxed as V

{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)

{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a    
nTimes n f = go n
  where
    {-# INLINE go #-}
    go n x | n <= 0 = x
    go n x          = go (n - 1) (f x)

main :: IO ()
main = do
  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  print $ V.sum (nTimes 1 incAll array)
$ stack ghc --package vector -- -O2 -ddump-simpl -dsuppress-all SO.hs
main2 =
  case (runSTRep main3) `cast` ...
  of _ { Vector ww1_s9vw ww2_s9vx ww3_s9vy ->
  case $wgo 1 ww1_s9vw ww2_s9vx ww3_s9vy
  of _ { (# ww5_s9w3, ww6_s9w4, ww7_s9w5 #) ->

We can stop right there. $wgo is the go defined above. Even with 1 GHC does not unroll the loop. It's disturbing since 1 is a constant.

Templates to the rescue

But alas, it's not all for naught. If C++ programmers are able to do the following for compile time constants, so should we, right?

template <int N>
struct Call{
    template <class F, class T>
    static T call(F f, T && t){
        return f(Call<N-1>::call(f,std::forward<T>(t)));
    }
};
template <>
struct Call<0>{
    template <class F, class T>
    static T call(F f, T && t){
        return t;
    }  
};

And sure enough, we can, with TemplateHaskell*:

-- Times.sh
{-# LANGUAGE TemplateHaskell #-}
module Times where

import Control.Monad (when)
import Language.Haskell.TH

nTimesTH :: Int -> Q Exp
nTimesTH n = do
  f <- newName "f"
  x <- newName "x"

  when (n <= 0) (reportWarning "nTimesTH: argument non-positive")

  let go k | k <= 0 = VarE x
      go k          = AppE (VarE f) (go (k - 1))
  return $ LamE [VarP f,VarP x] (go n)

What does nTimesTH do? It creates a new function where the first name f is going to be applied to the second name x for a total of n times. n now needs to be a compile-time constant, which suits us, since loop-unrolling is only possible with compile-time constants:

$(nTimesTH 0) = \f x -> x
$(nTimesTH 1) = \f x -> f x
$(nTimesTH 2) = \f x -> f (f x)
$(nTimesTH 3) = \f x -> f (f (f x))
...

Does it work? And is it fast? How fast compared to nTimes? Let's try another main for that:

-- SO.hs
{-# LANGUAGE TemplateHaskell #-}
module Main where
import Times
import qualified Data.Vector.Unboxed as V

{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)

{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a    
nTimes n f = go n
  where
    {-# INLINE go #-}
    go n x | n <= 0 = x
    go n x          = go (n - 1) (f x)

main :: IO ()
main = do
  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  let vTH   = V.sum ($(nTimesTH 64) incAll array)
  let vNorm = V.sum (nTimes 64 incAll array)
  print $ vTH == vNorm
stack ghc --package vector -- -O2 SO.hs && SO.exe +RTS -t
True
<<ghc: 52000056768 bytes, 66 GCs, 400034700/800026736 avg/max bytes residency (2 samples), 1527M in use, 0.000 INIT (0.000 elapsed), 8.875 MUT (9.119 elapsed), 0.000 GC (0.094 elapsed) :ghc>>

It yields the correct result. How fast is it? Let's use another main again:

main :: IO ()
main = do
  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  print $ V.sum ($(nTimesTH 64) incAll array)
     800,048,112 bytes allocated in the heap                                         
           4,352 bytes copied during GC                                              
          42,664 bytes maximum residency (1 sample(s))                               
          18,776 bytes maximum slop                                                  
             764 MB total memory in use (0 MB lost due to fragmentation)             

                                     Tot time (elapsed)  Avg pause  Max pause        
  Gen  0         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s        
  Gen  1         1 colls,     0 par    0.000s   0.049s     0.0488s    0.0488s        

  INIT    time    0.000s  (  0.000s elapsed)                                         
  MUT     time    0.172s  (  0.221s elapsed)                                         
  GC      time    0.000s  (  0.049s elapsed)                                         
  EXIT    time    0.000s  (  0.049s elapsed)                                         
  Total   time    0.188s  (  0.319s elapsed)                                         

  %GC     time       0.0%  (15.3% elapsed)                                           

  Alloc rate    4,654,825,378 bytes per MUT second                                   

  Productivity 100.0% of total user, 58.7% of total elapsed        

Well, compare that to the 8s. So for a TL;DR: if you have compile-time constants, and you want to create and/or modify your code based on that constants, consider Template Haskell.

* Please note that this is my first Template Haskell code I've ever written. Use with care. Don't use too large n, or you might end up with a messed up function.

András Kovács
  • 29,931
  • 3
  • 53
  • 99
Zeta
  • 103,620
  • 13
  • 194
  • 236
  • 2
    Note: the solution is [up for code review](https://codereview.stackexchange.com/questions/155144/execute-a-function-n-times-where-n-is-known-at-compile-time). – Zeta Feb 12 '17 at 10:22
  • Hey just coming back to let you know this is a brilliant answer in most aspects, thank you. – MaiaVictor Mar 05 '17 at 14:22
16

There is a little known trick which Andres has told me before where you can actually get GHC to inline recursive functions by using type classes.

The idea is that instead of writing a function normally where you perform structural recursion on a value. You define your function using type classes and perform structural recursion on a type argument. In this example, type-level natural numbers.

GHC will happily inline each recursive call and produce efficient code as each recursive call is at a different type.

I didn't benchmark this or look at the core but it is noticeably faster.

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE KindSignatures #-}
{-# LANGUAGE PolyKinds #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Main where

import qualified Data.Vector.Unboxed as V

data Proxy a = Proxy

{-# INLINE incAll #-}
incAll :: V.Vector Int -> V.Vector Int
incAll = V.map (+ 1)

oldNTimes :: Int -> (a -> a) -> a -> a
oldNTimes 0 f x = x
oldNTimes n f x = f (oldNTimes (n-1) f x)

-- New definition

data N = Z | S N

class Unroll (n :: N) where
    nTimes :: Proxy n -> (a -> a) -> a -> a

instance Unroll Z where
    nTimes _ f x = x

instance Unroll n => Unroll (S n) where
    nTimes p f x =
        let Proxy :: Proxy (S n) = p
        in f (nTimes (Proxy :: Proxy n) f x)

main :: IO ()
main = do
  let size = 100000000 :: Int
  let array = V.replicate size 0 :: V.Vector Int
  print $ V.sum (nTimes (Proxy :: Proxy (S (S (S (S (S (S (S (S (S (S (S Z)))))))))))) incAll array)
  print $ V.sum (oldNTimes 11 incAll array)
  • Nice, although if you want to use `nTimes 64`, the term `Proxy :: Proxy (S(S(S(S…(S Z)…)` will be rather… interesting to write. I'd use this with type level arithmetic, though. Someting like `Proxy (Ten :*: Six :+: Four)`. – Zeta Feb 13 '17 at 15:14
  • I still can't get those typeclass programming shenanigans, anyone who does that is clearly a wizard to me. – MaiaVictor Mar 05 '17 at 14:26
4

No.

You could write

{-# INLINE nTimes #-}
nTimes :: Int -> (a -> a) -> a -> a
nTimes n f x = go n
  where go 0 = x
        go n = f (go (n-1))

and GHC would inline nTimes, and likely specialize the recursive go to your particular arguments incAll and array, but it would not unroll the loop.

Joachim Breitner
  • 25,395
  • 6
  • 78
  • 139