Project Euler 14: performance compared to C and memoization

Question

I'm currently working on project euler problem 14.

I solved it using a poorly coded program, without memoization, that took ~~386~~ 5 seconds to run (see edit).

Here it is:

step :: (Integer, Int) -> Integer -> (Integer, Int)
step (i, m) n   | nextValue > m         = (n, nextValue)
                | otherwise             = (i, m)
                where nextValue = syr n 1

syr :: Integer -> Int -> Int
syr 1 acc   = acc
syr x acc   | even x    = syr (x `div` 2) (acc + 1)
            | otherwise = syr (3 * x + 1) (acc + 1)

p14 = foldl step (0, 0) [500000..999999]

My question is about several comments in the thread related to this problem, where were mentionned execution times of <1 s for programs as follow (C code, credits to the project euler forum user ix for the code -- note: I didn't check that the execution time is in fact as mentionned):

#include <stdio.h>


int main(int argc, char **argv) {
    int longest = 0;
    int terms = 0;
    int i;
    unsigned long j;
    for (i = 1; i <= 1000000; i++) {
        j = i;
        int this_terms = 1;
        while (j != 1) {
            this_terms++;
            if (this_terms > terms) {
                terms = this_terms;
                longest = i;
            }
            if (j % 2 == 0) {
                j = j / 2;
            } else {
                j = 3 * j + 1;
            }
        }
    }
    printf("longest: %d (%d)\n", longest, terms);
    return 0;
}

To me, those programs are kind of the same, when talking about the algorithm.

So I wonder why there is such a big difference? Or is there any fondamental difference between our two algorithms that can justify a x6 factor in performance?

BTW, I'm currently trying to implement this algorithm with memoization, but am kind of lost as to me, it's way easier to implement in an imperative language (and I don't manipulate monads yet so I can't use this paradigm). So if you have any good tutorial that fits a beginner to learn memoization, I'll be glad (the ones I encountered were not detailed enough or out of my league).

Note: I came to declarative paradigm through Prolog and am still in the very early process of discovering Haskell, so I might miss important things.

Note2: any general advice about my code is welcome.

EDIT: thanks to delnan's help, I compiled the program and it now runs in 5 seconds, so I mainly look for hints on memoization now (even if ideas about the existing x6 gap are still welcome).

Well, replacing `ghci` with `ghc -O` should be a no-brainer and is bound to narrow the gap. Regardless of how effective it is, trying it is probably a good idea, if only to answer the notorious "what have you tried?" comment ;) Not to mention that it's easier and faster to try out than bringing memoization in. — , Mar 18 '12 at 18:35
@delnan: well my problem is that I'm still new to Haskell and still didn't look at how to compile properly a program. Or to be more precise, I looked and saw that `do` was used and thought I'd go back at it later, when I'll be working on monads. But I'd have made the same comment reading this question, so I feel a bit stupid :p — m09, Mar 18 '12 at 18:38
To get you started, you can probably get away with `main = print p14` and compiling as `ghc -O p14.hs`. — , Mar 18 '12 at 18:42
@delnan: Thanks, it now runs in about 5 seconds, so that was definitely the main difference. I suggest you post your comment as an answer so that I can at least upvote it! I'll edit to precise that I'm now looking for answers mainly about memoization. — m09, Mar 18 '12 at 18:46
Well, 5x is still a pretty large difference, and I've got nothing to offer regarding memoization. I suggest you simply update the question with the new runtime, remove the GHCi reference, and carry on asking about memoization and other optimizations. I'd much rather see answers on that some rep :) — , Mar 18 '12 at 18:47
it is usually advised to use `-O2` switch when compiling. You can also add `{-# OPTIONS_GHC -O2 #-}` as top line in your source code and just compile with `ghc --make youfile`, then run as `>yourfile +RTS -s` to see the stats. — Will Ness, Mar 18 '12 at 18:54
For a version with memoization, I can only redirect you to my solution on the [Haskell wiki](http://www.haskell.org/haskellwiki/Euler_problems/11_to_20#Problem_14), the one under "Even faster solution...". It shows one of the nice way to do memoization in Haskell by exploiting lazy evaluation, much nicer than using State or other monadic code (except if you're using one of the library that do memoization, there's some nice work there). Note that if I remember correctly that is not that much faster than the non-memoizing solution. — Jedai, Mar 19 '12 at 00:40
@Jedai: yup I actually already read your solution. I still have some training to go through to be comfortable with it though. Thanks for mentionning it here :) — m09, Mar 19 '12 at 00:47
This is relevant, though I'm not sure if it's helpful. http://www.haskell.org/haskellwiki/Memoization#Memoising_CAFS. `wonderous` is `syr`, except without the accumulator (which would prevent memoization). — configurator, Mar 20 '12 at 20:10

Daniel Fischer · Accepted Answer · 2012-12-13T03:52:36.877

After having compiled it with optimisations, there are still several differences to the C programme

you use div, while the C programme uses machine division (which truncates) [but any self-respecting C compiler transforms that into a shift, so that makes it yet faster], that would be quot in Haskell; that reduced the run time by some 15% here.
the C programme uses fixed-width 64-bit (or even 32-bit, but then it's just luck that it gets the correct answer, since some intermediate values exceed 32-bit range) integers, the Haskell programme uses arbitrary precision Integers. If you have 64-bit Ints in your GHC (64-bit OS other than Windows), replace Integer with Int. That reduced the run time by a factor of about 3 here. If you're on a 32-bit system, you're out of luck, GHC doesn't use native 64-bit instructions there, these operations are implemented as C calls, that's still rather slow.

For the memoisation, you can outsource it to one of the memoisation packages on hackage, the only one that I remember is data-memocombinators, but there are others. Or you can do it yourself, for example keeping a map of previously calculated values - that would work best in the State monad,

import Control.Monad.State.Strict
import qualified Data.Map as Map
import Data.Map (Map, singleton)

type Memo = Map Integer Int

syr :: Integer -> State Memo Int
syr n = do
    mb <- gets (Map.lookup n)
    case mb of
      Just l -> return l
      Nothing -> do
          let m = if even n then n `quot` 2 else 3*n+1
          l <- syr m
          let l' = l+1
          modify (Map.insert n l')
          return l'

solve :: Integer -> Int -> Integer -> State Memo (Integer,Int)
solve maxi len start
    | len > 1000000 = return (maxi,len)
    | otherwise = do
         l <- syr start
         if len < l
             then solve start l (start+1)
             else solve maxi len (start+1)

p14 :: (Integer,Int)
p14 = evalState (solve 0 0 500000) (singleton 1 1)

but that will probably not gain too much (not even when you've added the necessary strictness). The trouble is that a lookup in a Map is not too cheap and an insertion is relatively expensive.

Another method is to keep a mutable array for the lookup. The code becomes more complicated, since you have to choose a reasonable upper bound for the values to cache (should be not much larger than the bound for the starting values) and deal with the parts of the sequences falling outside the memoised range. But an array lookup and write are fast. If you have 64-bit Ints, the below code runs pretty fast, here it takes 0.03s for a limit of 1 million, and 0.33s for a limit of 10 million, the corresponding (as closely as I reasonably could) C code runs in 0.018 resp. 0.2s.

module Main (main) where

import System.Environment (getArgs)
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
import Data.Bits
import Data.Int

main :: IO ()
main = do
    args <- getArgs
    let bd = case args of
               a:_ -> read a
               _   -> 100000
    print $ collMax bd

next :: Int -> Int
next n
    | n .&. 1 == 0  = n `unsafeShiftR` 1
    | otherwise     = 3*n + 1

collMax :: Int -> (Int,Int16)
collMax upper = runST $ do
    arr <- newArray (0,upper) 0 :: ST s (STUArray s Int Int16)
    let go l m
            | upper < m = go (l+1) $ next m
            | otherwise = do
                l' <- unsafeRead arr m
                case l' of
                  0 -> do
                      l'' <- go 1 $ next m
                      unsafeWrite arr m (l'' + 1)
                      return (l+l'')
                  _ -> return (l+l'-1)
        collect mi ml i
            | upper < i = return (mi, ml)
            | otherwise = do
                l <- go 1 i
                if l > ml
                  then collect i l (i+1)
                  else collect mi ml (i+1)
    unsafeWrite arr 1 1
    collect 1 1 2

`unsigned long` isn't necessarily 64-bit. I think `Word` should always be the same as `unsigned long` with GHC. — ehird, Mar 18 '12 at 19:00
@Daniel Fischer: sadly I'm on a windows machine right now, so the `Int` type is not sufficient. The `quot` modification doesn't help here, execution time is the same. Thanks for your answer though, you were definitely right about `Integer`s. — m09, Mar 18 '12 at 19:16
@ehird Correct, `unsigned long` isn't necessarily 64 bits. But if the programme worked, it is on the author's machine. Ah, dang, it need not be, but that's just luck if it's 32 bits. — Daniel Fischer, Mar 18 '12 at 19:18
@DanielFischer: thanks for the expansion on your answer, the part added is indeed a very instesting read. — m09, Dec 19 '12 at 09:03

score 4 · Answer 2 · edited May 23 '17 at 11:59

4

Well, the C program uses unsigned long, but Integer can store arbitrarily large integers (it's a bignum). If you import Data.Word, then you can use Word, which is a machine-word-sized unsigned integer.

After replacing Integer with Word, and using ghc -O2 and gcc -O3, the C program runs in 0.72 seconds, while the Haskell programs runs in 1.92 seconds. 2.6x isn't bad. However, ghc -O2 doesn't always help, and this is one of the programs on which it doesn't! Using just -O, as you did, brings the runtime down to 1.90 seconds.

I tried replacing div with quot (which uses the same type of division as C; they only differ on negative inputs), but strangely it actually made the Haskell program run slightly slower for me.

You should be able to speed up the syr function with the help of this previous Stack Overflow question I answered about the same Project Euler problem.

edited May 23 '17 at 11:59

Community

1
1

answered Mar 18 '12 at 18:58

ehird

40,602
3
180
182

1

Well, even without modifications suggested in your other answer, your `Word` modification did the job: down to 1s now. It was definitely an `Integer` overhead problem. Thanks! – m09 Mar 18 '12 at 19:14
1

@Mog Note that with 32-bit `Word`s, you have overflow, and it's sheer luck that you get the correct answer. – Daniel Fischer Mar 18 '12 at 19:22
@Mog: If you're on a 32-bit platform, try using `Word64`; it may have less overhead than `Integer`. – ehird Mar 18 '12 at 19:27
@DanielFischer: A 32-bit Word doesn't overflow, at least if my test with `Word32` is relevant. It brings up the solution in about 0.55s while the same program with `Word64` brings it up in 5+ seconds, just as `Integer` actually. It's kind of tricky that the bit used for sign in `Int` is precisely the one needed not to overflow ;( – m09 Mar 18 '12 at 19:31
3

@Mog The largest number that occurs in one of the sequences is 56991483520, from the starting value 704511. It's just luck that none of the overflows leads to a longer sequence or that the one with the longest sequence doesn't overflow. – Daniel Fischer Mar 18 '12 at 19:52
@Mog Install a 64-bit Linux in a VM. Or wait a bit, work on a 64-bit GHC on Windows is being done. – Daniel Fischer Mar 18 '12 at 19:55
2

@Daniel Fischer: well I don't plan on using windows for too long as it tends to provoke brain damage on me if I stay exposed. I'll get a fedora running soon enough and won't be bothered anymore by those pesky 32 bit `Int`s : d – m09 Mar 18 '12 at 19:58

Arlen · Answer 3 · 2012-04-04T22:03:59.143

On my current system (32-bit Core2Duo) your Haskell code, including all the optimizations given in the answers, takes 0.8s to compile and 1.2s to run.

You could transfer the run-time to compile-time, and reduce the run-time to nearly zero.

module Euler14 where

import Data.Word
import Language.Haskell.TH

terms :: Word -> Word
terms n = countTerms n 0
  where
    countTerms 1 acc             = acc + 1
    countTerms n acc | even n    = countTerms (n `div` 2) (acc + 1)
                     | otherwise = countTerms (3 * n + 1) (acc + 1)

longestT :: Word -> Word -> (Word, Word) 
longestT mi mx = find mi mx (0, 0)
  where
      find mi mx (ct,cn) | mi == mx  = if ct > terms mi then (ct,cn) else (terms mi, mi)
                         | otherwise = find (mi + 1) mx
                                       (if ct > terms mi then (ct,cn) else (terms mi, mi))

longest :: Word -> Word -> ExpQ
longest mi mx = return $ TupE [LitE (IntegerL (fromIntegral a)),
                               LitE (IntegerL (fromIntegral b))]
  where
    (a,b) = longestT mi mx

and

{-# LANGUAGE TemplateHaskell #-}
import Euler14

main = print $(longest 500000 999999)

On my system it takes 2.3s to compile this but the run-time goes down to 0.003s. Compile Time Function Execution (CTFE) is something you can't do in C/C++. The only other programming language that I know of that supports CTFE is the D programming language. And just to be complete, the C code takes 0.1s to compile and 0.7s to run.

In many cases, you can avoid having to write wrappers like `longest` yourself and instead use `lift` from `Language.Haskell.TH.Syntax` to construct a literal expression from the result of your calculation. (Annoyingly though, there is no `Lift` instance for `Word`, so some tweaks are needed to make that work in this case). — hammar, Apr 03 '12 at 21:31

Project Euler 14: performance compared to C and memoization

3 Answers3

Linked