How to improve performance of this numerical computation in Haskell?

Question

I'm in the middle of porting David Blei's original C implementation of Latent Dirichlet Allocation to Haskell, and I'm trying to decide whether to leave some of the low-level stuff in C. The following function is one example—it's an approximation of the second derivative of lgamma:

double trigamma(double x)
{
    double p;
    int i;

    x=x+6;
    p=1/(x*x);
    p=(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238)
         *p-0.033333333333333)*p+0.166666666666667)*p+1)/x+0.5*p;
    for (i=0; i<6 ;i++)
    {
        x=x-1;
        p=1/(x*x)+p;
    }
    return(p);
}

I've translated this into more or less idiomatic Haskell as follows:

trigamma :: Double -> Double
trigamma x = snd $ last $ take 7 $ iterate next (x' - 1, p')
  where
    x' = x + 6
    p  = 1 / x' ^ 2
    p' = p / 2 + c / x'
    c  = foldr1 (\a b -> (a + b * p)) [1, 1/6, -1/30, 1/42, -1/30, 5/66]
    next (x, p) = (x - 1, 1 / x ^ 2 + p)

The problem is that when I run both through Criterion, my Haskell version is six or seven times slower (I'm compiling with -O2 on GHC 6.12.1). Some similar functions are even worse.

I know practically nothing about Haskell performance, and I'm not terribly interested in digging through Core or anything like that, since I can always just call the handful of math-intensive C functions through FFI.

But I'm curious about whether there's low-hanging fruit that I'm missing—some kind of extension or library or annotation that I could use to speed up this numeric stuff without making it too ugly.

UPDATE: Here are two better solutions, thanks to Don Stewart and Yitz. I've modified Yitz's answer slightly to use Data.Vector.

invSq x = 1 / (x * x)
computeP x = (((((5/66*p-1/30)*p+1/42)*p-1/30)*p+1/6)*p+1)/x+0.5*p
  where p = invSq x

trigamma_d :: Double -> Double
trigamma_d x = go 0 (x + 5) $ computeP $ x + 6
  where
    go :: Int -> Double -> Double -> Double
    go !i !x !p
        | i >= 6    = p
        | otherwise = go (i+1) (x-1) (1 / (x*x) + p)

trigamma_y :: Double -> Double
trigamma_y x = V.foldl' (+) (computeP $ x + 6) $ V.map invSq $ V.enumFromN x 6

The performance of the two seems to be almost exactly the same, with one or the other winning by a percentage point or two depending on the compiler flags.

As camccann said over at Reddit, the moral of the story is "For best results, use Don Stewart as your GHC backend code generator." Barring that solution, the safest bet seems to be just to translate the C control structures directly into Haskell, although loop fusion can give similar performance in a more idiomatic style.

I'll probably end up using the Data.Vector approach in my code.

The C program uses loops, while in Haskell you're using heap allocated lists. They won't have the same performance. The best thing to do is to directly translate the control and data structures into Haskell, to keep the same performance. — Don Stewart, Jun 05 '10 at 03:29
Hi Travis! Will you release your code when finished? I found that I could understand your Haskell based on the C code.. Maybe it would be possible for me to learn Haskell in this way.. — Yin Zhu, Jun 06 '10 at 17:19
@Yin Zhu: Yes, of course! I'll post a link here when I'm done. I hope it will be useful to you. — Travis Brown, Jun 06 '10 at 17:34
@DeadMG: Thanks for the suggestion, but I'm not sure if it would be helpful since we're not taking square roots here. Am I missing something? — Travis Brown, Jun 06 '10 at 17:38
Whoops. I thought it was inverse square root, not inverse square. — Puppy, Jun 06 '10 at 19:10
Can you show comparision of performance of C and Haskell code using Data.Vector? — Trismegistos, Jan 28 '12 at 12:43

score 50 · Accepted Answer · edited Sep 10 '14 at 12:00

Use the same control and data structures, yielding:

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}

{-# INLINE trigamma #-}
trigamma :: Double -> Double
trigamma x = go 0 (x' - 1) p'
    where
        x' = x + 6
        p  = 1 / (x' * x')

        p' =(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238)
                  *p-0.033333333333333)*p+0.166666666666667)*p+1)/x'+0.5*p

        go :: Int -> Double -> Double -> Double
        go !i !x !p
            | i >= 6    = p
            | otherwise = go (i+1) (x-1) (1 / (x*x) + p)

I don't have your testsuite, but this yields the following asm:

A_zdwgo_info:
        cmpq    $5, %r14
        jg      .L3
        movsd   .LC0(%rip), %xmm7
        movapd  %xmm5, %xmm8
        movapd  %xmm7, %xmm9
        mulsd   %xmm5, %xmm8
        leaq    1(%r14), %r14
        divsd   %xmm8, %xmm9
        subsd   %xmm7, %xmm5
        addsd   %xmm9, %xmm6
        jmp     A_zdwgo_info

Which looks ok. This is the kind of code the -fllvm backend does a good job.

GCC unrolls the loop though, and the only way to do that is either via Template Haskell or manual unrolling. You might consider that (a TH macro) if doing a lot of this.

Actually, the GHC LLVM backend does unroll the loop :-)

Finally, if you really like the original Haskell version, write it using stream fusion combinators, and GHC will convert it back into loops. (Exercise for the reader).

Thanks, Don—this is great. Your version actually somehow beats the C version (slightly) in my test setup. For the record, though, the first line should read `trigamma x = go 0 (x' - 1) p'` and the instances of `x` in the definition of `p` and `p'` should be replaced by `x'`. — Travis Brown, Jun 05 '10 at 04:02
Just out of interest, did you use the genetic algorithm to reach those compile options? — Robert Massaioli, Jun 06 '10 at 06:51

score 8 · Answer 2 · edited Jun 11 '20 at 11:56

Before the optimization work, I wouldn't say that your original translation is the most idiomatic way to express in Haskell what the C code is doing.

How would the optimization process have proceeded if we started with the following instead:

trigamma :: Double -> Double
trigamma x = foldl' (+) p' . map invSq . take 6 . iterate (+ 1) $ x
where
  invSq y = 1 / (y * y)
  x' = x + 6
  p  = invSq x'
  p' =(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238)
              *p-0.033333333333333)*p+0.166666666666667)*p+1)/x'+0.5*p

How to improve performance of this numerical computation in Haskell?

2 Answers2

Linked