Comparing speed of Haskell and C for the computation of primes

Question

I initially wrote this (brute force and inefficient) method of calculating primes with the intent of making sure that there was no difference in speed between using "if-then-else" versus guards in Haskell (and there is no difference!). But then I decided to write a C program to compare and I got the following (Haskell slower by just over 25%) :

(Note I got the ideas of using rem instead of mod and also the O3 option in the compiler invocation from the following post : On improving Haskell's performance compared to C in fibonacci micro-benchmark)

Haskell : Forum.hs

divisibleRec :: Int -> Int -> Bool
divisibleRec i j 
  | j == 1         = False 
  | i `rem` j == 0 = True 
  | otherwise      = divisibleRec i (j-1)

divisible::Int -> Bool
divisible i = divisibleRec i (i-1)

r = [ x | x <- [2..200000], divisible x == False]

main :: IO()
main = print(length(r))

C : main.cpp

#include <stdio.h>

bool divisibleRec(int i, int j){
  if(j==1){ return false; }
  else if(i%j==0){ return true; }
  else{ return divisibleRec(i,j-1); }
}

bool divisible(int i){ return divisibleRec(i, i-1); }

int main(void){
  int i, count =0;
  for(i=2; i<200000; ++i){
    if(divisible(i)==false){
      count = count+1;
    }
  }
  printf("number of primes = %d\n",count);
  return 0;
}

The results I got were as follows :

Compilation times

time (ghc -O3 -o runProg Forum.hs)
real    0m0.355s
user    0m0.252s
sys 0m0.040s

time (gcc -O3 -o runProg main.cpp)
real    0m0.070s
user    0m0.036s
sys 0m0.008s

and the following running times :

Running times on Ubuntu 32 bit

Haskell
17984
real    0m54.498s
user    0m51.363s
sys 0m0.140s


C++
number of primes = 17984
real    0m41.739s
user    0m39.642s
sys 0m0.080s

I was quite impressed with the running times of Haskell. However my question is this : can I do anything to speed up the haskell program without :

Changing the underlying algorithm (it is clear that massive speedups can be gained by changing the algorithm; but I just want to understand what I can do on the language/compiler side to improve performance)
Invoking the llvm compiler (because I dont have this installed)

[EDIT : Memory usage]

After a comment by Alan I noticed that the C program uses a constant amount of memory where as the Haskell program slowly grows in memory size. At first I thought this had something to do with recursion, but gspr explains below why this is happening and provides a solution. Will Ness provides an alternative solution which (like gspr's solution) also ensures that the memory remains static.

[EDIT : Summary of bigger runs]

max number tested : 200,000:

(54.498s/41.739s) = Haskell 30.5% slower

max number tested : 400,000:

3m31.372s/2m45.076s = 211.37s/165s = Haskell 28.1% slower

max number tested : 800,000:

14m3.266s/11m6.024s = 843.27s/666.02s = Haskell 26.6% slower

[EDIT : Code for Alan]

This was the code that I had written earlier which does not have recursion and which I had tested on 200,000 :

#include <stdio.h>

bool divisibleRec(int i, int j){
  while(j>0){
    if(j==1){ return false; }
    else if(i%j==0){ return true; }
    else{ j -= 1;}
  }
}


bool divisible(int i){ return divisibleRec(i, i-1); }

int main(void){
  int i, count =0;
  for(i=2; i<8000000; ++i){
    if(divisible(i)==false){
      count = count+1;
    }
  }
  printf("number of primes = %d\n",count);
  return 0;
}

The results for the C code with and without recursion are as follows (for 800,000) :

With recursion : 11m6.024s

Without recursion : 11m5.328s

Note that the executable seems to take up 60kb (as seen in System monitor) irrespective of the maximum number, and therefore I suspect that the compiler is detecting this recursion.

Does using `r = filter (not . divisible) [2..200000]` instead of the list comprehension make any difference? — huon, Aug 19 '12 at 13:03
http://www.haskell.org/haskellwiki/Prime_numbers might be of interest and maybe there are some interesting things (like arrays). — Laar, Aug 19 '12 at 13:36
@dbaupp : no using "r = filter (not . divisible) [2..200000]" makes no difference. Thanks — artella, Aug 19 '12 at 14:35
I don't know how Haskell is implemented internally, but in C this type of super deep recursion is considered a bad practice and I'm surprised this doesn't stack overflow and crash. Since it doesn't crash, the optimizer may be recognizing the unnecessary recursion and removing it. If it is actually recursing that deep than you are likely cache bound rather than computation-bound. Converting the recursive call in the C version into a simple nested for loop would be more typical C. — Alan, Aug 19 '12 at 16:07
@Alan I know it is not good practice but I wanted to mimic the structure of the haskell code. I actually did a loop implementation too, and it made no difference for 200,000. Maybe I will try for the 800,000 case and report back. Thanks. — artella, Aug 19 '12 at 16:13
Regarding your point 2: With GHC 7.4.1 and LLVM 3.0 I get 2m 28.489s with LLVM versus 2m 31.573s without, which is probably not statistically significant. — gspr, Aug 19 '12 at 16:33
@Alan : (see above) yes it makes no difference for 800,000 & memory used is independent of max number, so compiler must be recognising recursion. — artella, Aug 19 '12 at 16:46
@Alan : interestingly I just noticed that the haskell program slowly grows in the amount of memory it takes — artella, Aug 19 '12 at 16:58
@artella: You mean the amount of memory it takes grows when you increase the 200000 number? That's not strange: When that number grows, so does the list `r`. Your code needs all of `r` at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage (the code will still be very Haskelly). Try `main = print $ foo [2..200000]` with `foo (x:xs) = if divisible x then foo xs else 1 + foo xs` (you'll also need the base case `foo [] = 0`. — gspr, Aug 19 '12 at 18:28
I'm not seeing any memory growth in a heap profile of the program. — Carl, Aug 19 '12 at 18:56
@Carl, I was just using "System monitor" in ubuntu, and looking at the "Memory" column. However gspr explained above and in the post below how to get around this. When I tried his suggestion the memory usage did remain static in System monitor. Thanks — artella, Aug 19 '12 at 18:59
btw the reason for your C++ code running essentially in the same time with and without recursion is that gcc in all probability performed tail call optimization there - just as GHC most probably did as well - turning syntactical recursion into an iterative loop. — Will Ness, Aug 19 '12 at 19:57

score 5 · Answer 1 · answered Aug 19 '12 at 18:39

5

This isn't really answering your question, but rather what you asked in a comment regarding growing memory usage when the number 200000 grows.

When that number grows, so does the list r. Your code needs all of r at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage. The code will still be very Haskelly, and in general it's a sensible proposition: you don't really need the list of numbers for which divisible is False, you just need to know how many there are.

You can try with

main :: IO ()
main = print $ foldl' (\s x -> if divisible x then s else s+1) 0 [2..200000]

(foldl' is a stricter foldl from Data.List that avoids thunks being built up).

answered Aug 19 '12 at 18:39

gspr

11,144
3
41
74

Ahh Thanks! I totally missed that. I was so focused on the recursion that I was thinking that the recursion was the cause of the memory increase! I tried your code, and the memory does indeed remain static. Cheers. – artella Aug 19 '12 at 18:55
1

The list `r` shouldn't grow at all. It should get consumed as it is being built, and consumed stuff get discarded right away. I tried replacing the OP's code with `r3 n = length [ () | x <- [2..n], divisible x == False]` and it made no difference. – Will Ness Aug 19 '12 at 19:01
@WillNess : for the program I originally wrote, the memory definitely does increase in ubuntu (and it also seems to increase for your suggestion). However for gspr's suggestion above, the memory usage remains static. Thanks – artella Aug 19 '12 at 19:07
@artella interesting. that would indicate `length` as a culprit. I didn't notice any memory increase in my testings, it was always reported as 1MB, running the exe (yes, on Windows) with `+RTS -s`. ***But what about the speed?*** (of this code by gspr I mean). – Will Ness Aug 19 '12 at 19:12
@WillNess The speed was identical. That is the code with gspr's suggestion ran just as fast as original code. – artella Aug 19 '12 at 19:23
1

@artella one last thing that's nagging me, the memory. As per Thomas M. DuBuisson's remark at end of his post, it is about having a ***named list*** under the name `r`. My suggestion was using a *function* `r3 n = length ...`. You remarked it wasn't constant in memory use, which still seems very strange to me . Perhaps you tested my line of code, but with the named list, still? Perhaps it was a misunderstanding? Could you please test it with a function? Giving things names usually causes them 2be retained in memory, but the computation described by `r3` ***should*** run in constant memory..? – Will Ness Aug 20 '12 at 06:52
1

@WillNess : My apologies, you are absolutely correct. This time I watched it for longer and this is what I noticed : initially it increases to 800KB and then it suddenly falls to 500KB. Then again it starts increasing until it reaches 744KB, remaining at this level thereafter. Thank you. – artella Aug 20 '12 at 08:49
@artella great news, thanks! :) anything that is consistently under 1MB is constant memory for sure... :) – Will Ness Aug 20 '12 at 09:24

Thomas M. DuBuisson · Answer 2 · 2012-08-19T18:54:16.847

2

Well bang patters give you a very small win (as does llvm, but you seem to have expected that):

{-# LANUGAGE BangPatterns #-}

divisibleRec !i !j | j == 1         = False

And on my x86-64 I get a very big win by switching to smaller representations, such as Word32:

divisibleRec :: Word32 -> Word32 -> Bool
...
divisible :: Word32 -> Bool

My timings:

$ time ./so             -- Int
2262

real    0m2.332s

$ time ./so              -- Word32
2262

real    0m1.424s

This is a closer match to your C program, which is only using int. It still doesn't match performance wise, I suspect we'd have to look at core to figure out why.

EDIT: and the memory use, as was already noted I see, is about the named list r. I just inlined r, made it output a 1 for each non-divisble value and took the sum:

main = print $ sum $ [ 1 | x <- [2..800000], not (divisible x) ]

edited Aug 19 '12 at 18:54

answered Aug 19 '12 at 18:38

Thomas M. DuBuisson

64,245
7
109
166

or `length [ () | x <- [2..n], divisible x == False]`, which made no difference for me. `length [()|x<-[2..n],and [rem x d>0|d<-[x-1,x-2..2]]]` was much slower, and using `all` more slower still. – Will Ness Aug 19 '12 at 19:10
Hi, what was the rest of the code for the "BangPatterns" divisibleRec? Thanks – artella Aug 19 '12 at 19:24
What about the code? I didn't change any of the code I didn't show (so I just added the bang patterns and modified the type signatures then adjusted main to fix the memory use. – Thomas M. DuBuisson Aug 19 '12 at 19:45
@ThomasM.DuBuisson : Sorry I wasnt sure what BangPatterns were. I tried it out and it worked, but it made no difference to the speed. Thanks – artella Aug 19 '12 at 20:17
The main difference was the different types - did that do anything for you? Also, what is your platform? – Thomas M. DuBuisson Aug 19 '12 at 21:56
@ThomasM.DuBuisson I tried switching to Word32 and it made no difference. I am using ubuntu 32 bit. Why would Word32 vs Int make a difference? I thought that both used 32 bit containers? Thanks – artella Aug 19 '12 at 22:42
On a 64 bit system `Int` is 64 bit - hence the difference on my system. – Thomas M. DuBuisson Aug 19 '12 at 23:49

Will Ness · Answer 3 · 2012-08-19T19:37:43.873

1

Another way to write down your algorithm is

main = print $ length [()|x<-[2..200000], and [rem x d>0|d<-[x-1,x-2..2]]]

Unfortunately, it runs slower. Using all ((>0).rem x) [x-1,x-2..2] as a test, it runs slower still. But maybe you'd test it on your setup nevertheless.

Replacing your code with explicit loop with bang patterns made no difference whatsoever:

{-# OPTIONS_GHC -XBangPatterns #-}
r4::Int->Int
r4 n = go 0 2 where
  go !c i | i>n = c
          | True = go (if not(divisible i) then (c+1) else c) (i+1)

divisibleRec::Int->Int->Bool
divisibleRec i !j | j == 1 = False 
                  | i `rem` j == 0 = True 
                  | otherwise = divisibleRec i (j-1)

edited Aug 19 '12 at 19:37

answered Aug 19 '12 at 15:33

Will Ness

70,110
9
98
181

Hi, yes unfortunately it is slower. It took 9m55.231s versus the 1 minute for the original code. However it is very concise, which is nice:) – artella Aug 19 '12 at 19:41
1

(bang patterns in `divisibleRec` totally unnecessary: the very first thing that code does is compare `j==1`, which forces `j` anyway...). – Will Ness Aug 19 '12 at 20:01
Cool thanks. I wasnt sure what it was and so tried it out, and yes it makes no difference in ubuntu also. Thanks – artella Aug 19 '12 at 20:18

score 0 · Answer 4 · answered Aug 19 '12 at 17:26

0

When I started programming in Haskell I was also impressed about its speed. You may be interested in reading point 5 "The speed of Haskell" of this article.

answered Aug 19 '12 at 17:26

Claudi

5,224
17
30

Yes speed wise it is good, but I am not impressed by the fact that the memory gradually increases. There must be a way to implement the algorithm and have the program use a constant amount of memory (like the C program)....but I havent found it yet. – artella Aug 19 '12 at 17:54
see comment by gspr. His post above explains why the memory was increasing and puts me at ease! – artella Aug 19 '12 at 18:56

Comparing speed of Haskell and C for the computation of primes

4 Answers4

Linked