Haskell: Why does Int perform worse than Word64, and why is my program far slower than C?

Question

I was reading an article of how slow Haskell is in playing with Collatz conjecture, which basically states if you keep multiplying by three and adding one to an odd number, or dividing an even one with two, you will eventually get one. For instance, 3 -> 10 -> 5 -> 16 -> 8 -> 4 -> 2 -> 1.

The program given in this article is to calculate the longest Collatz sequence in a given range. The C version is:

#include <stdio.h>

int main(int argc, char **argv) {
   int max_a0 = atoi(argv[1]);
   int longest = 0, max_len = 0;
   int a0, len;
   unsigned long a;

   for (a0 = 1; a0 <= max_a0; a0++) {
      a = a0;
      len = 0;

      while (a != 1) {
         len++;
         a = ((a%2==0)? a : 3*a+1)/2;
      }

      if (len > max_len) {
         max_len = len;
         longest = a0;
      }
   }
   printf("(%d, %d)\n", max_len, longest);
   return 0;
}

Compiling with Clang O2, it runs on my computer for 0.2s.

The Haskell version given in that article generates the whole sequence as a list explicitly, and then calculate the length of the intermediate list. It is 10x slower than the C version. However, since the author used LLVM as a backend, which I haven't installed, I couldn't reproduce this. Using GHC 7.8 and the default backend, it runs 10s on my Mac, which is 50x slower than the C version.

Then, I wrote a version using tail recursion and not generating an intermediate list:

collatzNext :: Int -> Int
collatzNext a
  | even a    = a `div` 2
  | otherwise = (3 * a + 1) `div` 2

collatzLen :: Int -> Int
collatzLen n = collatzIter n 0
  where
    collatzIter 1 len = len
    collatzIter n len = collatzIter (collatzNext n) (len + 1)

main = do
  print $ maximum $ [collatzLen x | x <- [1..1000000]]

Compiled with GHC 7.8 and O2, it runs for 2s, 10x slower than the C version.

Interestingly, when I changed Int in the type annotation to Word, it only spent 1s, 2x faster!

I have tried BangPatterns for explicit strict evaluation, but no significant performance gain could be noticed — I guess GHC's strict analysis is smart enough to handle such a simple scenario.

My questions are:

Why is the Word version so much faster compared with the Int one?
Why is this Haskell program so slow compared with that written in C?

Just a comment, it is not woth to compare any run-time below 1 second. Maybe the libraries loading takes time? Who knows. Try timing specific parts of the code and use larger problems or compute them multiple times. — Vladimir F Героям слава, Apr 26 '15 at 09:37
@VladimirF Since I have used -O2 optimization, I think even 0.1 second means a lot. @n.m. I lack experience in profiling Haskell programs. I googled a little and checked the GHC manual. The simple time profiling turns out that 73.3% time is spent in `collatzLen.collatzIter`, while 24.3% in `collatzNext`. I am sorry but that's the only thing I can get with my current understanding of Haskell profiling... — Minsheng Liu, Apr 26 '15 at 09:44
It may be a lot, but you must time specific parts of the code, not the OS overhead, to be sure. O2 does not mean too much by itself. — Vladimir F Героям слава, Apr 26 '15 at 09:45
@VladimirF OS overhead can be measured approximately: by reducing n from 1000000 to 100, the time used by this program is `./a 0.00s user 0.00s system 29% cpu 0.013 total`, when n = 1000000, the time is `./a 1.14s user 0.01s system 99% cpu 1.157 total`. I also checked RTS's GC report, time used in GC is less than 1%. — Minsheng Liu, Apr 26 '15 at 09:51
Good to know, I was just telling you to be careful and go more int the details. Just comparing the result of `time ./a.out` is dangerous below 1 second. — Vladimir F Героям слава, Apr 26 '15 at 09:54
You should note that division operation with powers of `2` is generally faster with unsigned int rather than signed int. And that may be the case why your `Word` instance is faster because `Word` represents unsigned integer. — Sibi, Apr 26 '15 at 10:02
The `div` function is slower for Int than Word. For speed you should use `quot` with Int. — augustss, Apr 26 '15 at 10:31
Changing `quot` into `div` and adding `-fllvm` seem to do the trick. — Overmind Jiang, Apr 26 '15 at 11:48
BTW, the `collatzNext` function is wrong here and in the linked blog post. It should be `3 * n + 1` rather than `(3 * n + 1) / 2` in the odd case. — András Kovács, Apr 26 '15 at 12:04

score 22 · Accepted Answer · edited May 23 '17 at 10:30

The performance of this program depends on several factors. If we get all of them right, the performance becomes the same as that of the C program. Going through these factors:

1. Using and comparing the right word sizes

The posted C code snippet is not exactly right; it uses 32-bit integers on all architectures, while the Haskell Int-s are 64 bit on a 64-bit machine. Before anything else, we should make sure to use the same word size in both programs.

Also, we should always use native-sized integral types in our Haskell code. So if we're on a 64-bit system, we should use 64-bit numbers, and steer clear of Int32-s and Word32-s, unless there is a specific need for them. This is because operations on non-native integers are mostly implemented as foreign calls rather than primops, so they're significantly slower.

2. Division in collatzNext

div is slower than quot for Int, because div handles negative numbers differently. If we use div and switch to Word, the program gets faster, because div is the same as quot for Word. quot with Int works just as fine. However, this is still not as fast as C. We can divide by two by shifting bits to the right. For some reason not even LLVM does this particular strength reduction in this example, so we're best off doing it by hand, replacing quot n 2 by shiftR n 1.

3. Checking evenness

The fastest way to check this is by checking the least significant bit. LLVM can optimize even to this, while the native codegen cannot. So, if we're on native codegen, even n could be replaced with n .&. 1 == 0, and this gives a nice performance boost.

However, I found something of a performance bug with GHC 7.10. Here we don't get an inlined even for Word, which wrecks performance (calling a function with a heap-allocated Word box in the hottest part of the code does this). So here we should use rem n 2 == 0 or n .&. 1 == 0 instead of even. The even for Int gets inlined fine though.

4. Fusing away lists in collatzLen

This is a crucial factor. The linked blog post is a bit out-of-date with regards to this. GHC 7.8 can't do fusion here, but 7.10 can. This means that with GHC 7.10 and LLVM we can conveniently get C-like performance without significantly modifying the original code.

collatzNext a = (if even a then a else 3*a+1) `quot` 2
collatzLen a0 = length $ takeWhile (/= 1) $ iterate collatzNext a0
maxColLen n   = maximum $ map collatzLen [1..n]

main = do
    [n] <- getArgs
    print $ maxColLen (read n :: Int)

With ghc-7.10.1 -O2 -fllvm and n = 10000000, the above program runs in 2.8 seconds, while the equivalent C program runs in 2.4 seconds. If I compile the same code without LLVM, then I instead get 12.4 second runtime. This slowdown is entirely because of the lack of optimization on even. If we use a .&. 1 == 0, then the slowdown disappears.

5. Fusing away lists when computing the maximum length

Not even GHC 7.10 can do this, so we have to resort to manual loop-writing.

collatzNext a = (if a .&. 1 == 0 then a else 3*a+1) `shiftR` 1
collatzLen    = length . takeWhile (/= 1) . iterate collatzNext

maxCol :: Int -> Int
maxCol = go 1 1 where
  go ml i n | i > n = ml
  go ml i n = go (max ml (collatzLen i)) (i + 1) n

main = do
  [n] <- getArgs
  print $ maxCol (read n :: Int)

Now, for ghc-7.10.1 -O2 -fllvm and n = 10000000, the above code runs in 2.1 seconds, while the C program runs in 2.4 seconds. If we want to achieve similar performance without LLVM and GHC 7.10, we just have to manually apply the important missing optimizations:

collatzLen :: Int -> Int
collatzLen = go 0 where
  go l 1 = l
  go l n | n .&. 1 == 0 = go (l + 1) (shiftR n 1)
         | otherwise    = go (l + 1) (shiftR (3 * n + 1) 1)

maxCol :: Int -> Int
maxCol = go 1 1 where
  go ml i n | i > n = ml
  go ml i n = go (max ml (collatzLen i)) (i + 1) n

main = do
  [n] <- getArgs
  print $ maxCol (read n :: Int)

Now, with ghc-7.8.4 -O2 and n = 10000000, our code runs in 2.6 seconds.

Thanks! Would you mind me translating this answer into Chinese and putting it on a Chinese website? I asked my question there too. Of course, your name and link to this answer will be explicitly put. I just hope to make this answer read by more people. — Minsheng Liu, Apr 26 '15 at 16:10
I don't mind. BTW, the posts belong to SO under Creative Commons Share Alike license, AFAIK, so you don't necessarily need my permission (but thanks for asking anyway). — András Kovács, Apr 26 '15 at 16:25
Is the performance bug with GHC 7.10 (no inline of `even` with Word) a known issue that would hopefully get fixed in GHC 7.10.2 ? — Pierre R, Apr 29 '15 at 19:29
`Data.Bits.lsb` would be a much cleaner way of checking the least significant bit. Can't check right now -- does that optimize just as well for `Word` as the `(.&.)` version? — Cactus, Jul 28 '23 at 00:11

Haskell: Why does Int perform worse than Word64, and why is my program far slower than C?

1 Answers1

Linked