2

I have recently started using Data.Vector. My understanding is that it should be able to take chains of vector operations and efficiently combine them via fusion.

In particular, I would expect these two functions to have similar performance characteristics (i.e. searchTopDown to fuse into something like searchTd).

import qualified Data.Vector.Unboxed as VU
import Data.Vector.Unboxed (fromList, (!))

--| Going down from i searches the vector for the element satisfying predicate 
searchTopDown :: VU.Vector Int -> Int -> (Int -> Bool) -> (Int, Int)
searchTopDown vec i pred =
    VU.head $ VU.dropWhile (not . pred . snd)
    $ VU.reverse $ VU.take (1 + i)
    $ VU.indexed vec

searchTd :: VU.Vector Int -> Int -> (Int -> Bool) -> (Int, Int)
searchTd vec i pred = let val = vec ! i
                    in
                      if pred val then (i,  val) else searchTd vec (i-1) pred

However, benchmarking this code shows performance difference in orders of magnitude.

testTd/150
                 lower bound    estimate    upper bound
OLS regression      4.51 μs     4.53 μs     4.56 μs
R² goodness-of-fit  1.000       1.000       1.000
Mean execution time 4.52 μs     4.53 μs     4.55 μs
Standard deviation  35.2 ns     50.0 ns     69.8 ns

topDown/150
                 lower  bound    estimate    upper bound
OLS regression     75.4 μs       76.6 μs     77.9 μs
R² goodness-of-fit  0.997        0.998       0.999
Mean execution time 74.6 μs      75.2 μs     76.1 μs
Standard deviation  1.58 μs      2.42 μs     3.75 μs

Questions: What prevents searchTopDown from being fused? What can I do to help it fuse? When should I expect fusion in general?

Code used for benchmarks:

import qualified Data.List as L
import Criterion.Main

vec :: VU.Vector Int
vec = fromList $ L.replicate 100 (-1) L.++ L.replicate 100 1

testTd :: Int -> Int
testTd i = fst $ searchTd vec i (<0)

testTopDown :: Int -> Int
testTopDown i = fst $ searchTopDown vec i (<0)


benchMain = defaultMain [
     bgroup "testTd"    [ bench "1"  $ whnf testTd 1
                        , bench "90"  $ whnf testTd 90
                        , bench "100"  $ whnf testTd 100
                        , bench "150" $ whnf testTd 150
      ],
      bgroup "topDown"  [ bench "1"  $ whnf testTopDown 1
                        , bench "90"  $ whnf testTopDown 90
                        , bench "100"  $ whnf testTopDown 100
                        , bench "150" $ whnf testTopDown 150
      ]
  ]

EDIT Following @Zeta suggestions (adding -O2) I get excellent performance from both functions. However, I experience this problem in a larger project so I changed benchmarking code to resemble it. searchTopDown seems to be linear in i for this case. New benchmarking code and results (in microsecs):

import Data.Vector.Unboxed (fromList, (!))
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Unboxed.Mutable as VM
import Control.Monad.ST
import qualified Data.List as L

datStream :: [(Int, Int)]
datStream = L.zip [0..] $ L.replicate 100 (-1) L.++ L.replicate 100 1

vecUpdate v (iChg, vChg) = runST $ do
    mv <- VU.unsafeThaw v
    VM.modify mv (+ vChg) iChg
    VU.unsafeFreeze mv

vecs :: [VU.Vector Int]
vecs = let vecInit = VU.replicate 200 0 in L.drop 1 $ L.scanl' vecUpdate vecInit datStream

test f i = L.sum $ L.map (\x -> fst $ f x i (<0)) vecs
testTd = test searchTd
testTopDown = test searchTopDown

enter image description here

Artem Solod
  • 417
  • 2
  • 9
  • Possibly related: http://stackoverflow.com/q/42178164/1139697 – Zeta Feb 21 '17 at 14:58
  • 1
    Probably `VU.reverse`, since vector fusion is based on stream fusion and you can't reverse a stream. You could build a stream that traverses the original `vec` in reverse order though. – Reid Barton Feb 21 '17 at 15:02
  • @ReidBarton If everything is in a single module, `topDown` beats `testTd` in all benchmarks. I blame missing cross-module inlining, but I haven't looked at the core yet. – Zeta Feb 21 '17 at 15:04
  • But in real life you probably read `vec` from a file or something. So the lack of inlining gives you a more realistic benchmark. – Reid Barton Feb 21 '17 at 15:06
  • 3
    Even if I disable inlining, `topDown` beats `testTd` on 150 (1µs vs 750 ns), but loses on all others). I am only able to reproduce this behaviour if I don't specificy any optimizations at all. With `-O2`, `topDown` always stays below 1µs. @Artem, you did enable optimization, right? – Zeta Feb 21 '17 at 15:12
  • 2
    @Zeta, It looks like I did not. Compiling with `-O2` brings benchmarks to <100ns for `topDown/150` and ~400ns for `testTD/150`.These functions are used in a bigger project where I see performance difference similar to that in initial benchmarks irrespective of optimization flags used. I will try to dig a bit more and come up with a better minimal example. – Artem Solod Feb 21 '17 at 15:51
  • But even with optimization `testTd/90` is over an order of magnitude faster than `topDown/90`, right? – Reid Barton Feb 21 '17 at 17:57
  • @ReidBarton, `-O3` yields flat 30ns for `topDown/1/90/100`, 100ns for `topDown/150` and 19ns for `testTd/1/90`, 29ns for `testTd/100`, 420ns for `testTd/150`. In fact I'm quite impressed with what optimizer does to `topDown/150` – Artem Solod Feb 21 '17 at 18:58
  • 1
    There is no `-O3`. That option will do exactly the same thing as `-O2`. – dfeuer Feb 21 '17 at 19:16

0 Answers0