Why is factorial calculation much faster in Haskell than in Java

Question

One of the programming problems I have come across involves calculation of factorials of large numbers (of numbers up to 10^5). I have seen a simple Haskell code which goes like this

factorial :: (Eq x, Num x) => x -> x
factorial 0 = 1
factorial a = a * factorial (a - 1)

which implicitly handles the huge numbers and somehow runs faster even without any caching that is involved in the code.

When I tried to solve the problem using Java, I had to use BigInteger to hold the huge numbers and also use iterative version of factorial

public static BigInteger factorialIterative(int n)
{
        if(n == 0 || n == 1) return BigInteger.valueOf(1);
        BigInteger f = BigInteger.valueOf(1);
        for(int i = 1 ; i <= n ;i++)
            f = f.multiply(BigInteger.valueOf(i));
        return f;
}

The above code exceeded the set time limit of execution for the program . I also tried the cached recursive version of factorial

public static BigInteger factorial(int n)
{
     if(cache[n] != null) 
         return cache[n];
     else if(n == 0) 
         return new BigInteger("1");
     else {
         cache[n] = n* factorial(n - 1);
         return cache[n]; 
     }
}

which gave me an out of memory error (probably due to recursion).

My Question is , why are functional programming languages like Haskell better in handling these sorts of problems involving huge numbers? (despite no obvious caching). Is there a way I can make the java code run as fast as the Haskell code?

It would be good if you gave full (runnable) programs and mentioned exactly how you were compiling and running them. It's possible that the problem isn't where you expect. — shachaf, Jul 11 '13 at 05:25
I was uploading the code to a a programming challenge site [link](http://www.hackerrank.com) which provides compilers for various languages including Java and Haskell.. I don't know which compiler they use though — quirkystack, Jul 11 '13 at 05:38
OK, I benchmarked it myself and I'm also seeing the Java code be much slower than the Haskell code. My guess is that this is a problem with Java's `BigInteger` being much slower than GMP (the library GHC uses for `Integer`), and has almost nothing to do with the language itself. — shachaf, Jul 11 '13 at 05:45

score 35 · Answer 1 · edited May 23 '17 at 12:17

The difference is, as shachaf said, that GHC (by default) uses GMP for Integer computations that exceed the Int range, and GMP is rather well optimised. It has nothing to do with purity, caching, tail-call optimisation or such.

Java's BigInteger uses more or less the naive schoolbook algorithms. If you look at the code for multiply (openjdk7), the work horse is

/**
 * Multiplies int arrays x and y to the specified lengths and places
 * the result into z. There will be no leading zeros in the resultant array.
 */
private int[] multiplyToLen(int[] x, int xlen, int[] y, int ylen, int[] z) {
    int xstart = xlen - 1;
    int ystart = ylen - 1;

    if (z == null || z.length < (xlen+ ylen))
        z = new int[xlen+ylen];

    long carry = 0;
    for (int j=ystart, k=ystart+1+xstart; j>=0; j--, k--) {
        long product = (y[j] & LONG_MASK) *
                       (x[xstart] & LONG_MASK) + carry;
        z[k] = (int)product;
        carry = product >>> 32;
    }
    z[xstart] = (int)carry;

    for (int i = xstart-1; i >= 0; i--) {
        carry = 0;
        for (int j=ystart, k=ystart+1+i; j>=0; j--, k--) {
            long product = (y[j] & LONG_MASK) *
                           (x[i] & LONG_MASK) +
                           (z[k] & LONG_MASK) + carry;
            z[k] = (int)product;
            carry = product >>> 32;
        }
        z[i] = (int)carry;
    }
    return z;
}

a quadratic digit-by-digit multiplication (digits are of course not base 10). That doesn't hurt too much here, since one of the factors is always single-digit, but indicates that not too much work has yet been put into optimising BigInteger computations in Java.

One thing that can be seen from the source is that in Java products of the form smallNumber * largeNumber are faster than largeNumber * smallNumber (in particular if the small number is single-digit, having that as the first number means the second loop with the nested loop doesn't run at all, so you have altogether less loop control overhead, and the loop that is run has a simpler body).

So changing

f = f.multiply(BigInteger.valueOf(i));

in your Java version to

f = BigInteger.valueOf(i).multiply(f);

gives a considerable speedup (increasing with the argument, ~2× for 25000, ~2.5× for 50000, ~2.8× for 100000).

The computation is still much slower than the GHC/GMP combination by a factor of roughly 4 in the tested range on my box, but, well, GMP's implementation is plain better optimised.

If you make computations that often multiply two large numbers, the algorithmic difference between the quadratic BigInteger multiplication and GMP's that uses Karatsuba or Toom-Cook when the factors are large enough (FFT for really large numbers) would show.

However, if multiplying is not all that you do, if you print out the factorials, hence convert them to a String, you get hit by the fact that BigInteger's toString method is abominably slow (it's roughly quadratic, so since the computation of the factorial is altogether quadratic in the length of the result, you get no [much] higher algorithmic complexity, but you get a big constant factor on top of the computation). The Show instance for Integer is much better, O(n * (log n)^x) [not sure what x is, between 1 and 2], so converting the result to String adds just a little to the computation time.

Indeed. The sad thing is that this is all well known for years, and yet improvements like https://github.com/tbuktu/bigint didn't make it into Java7. — Ingo, Jul 11 '13 at 10:09
Agreed. I can understand that Sun/Oracle have other priorities, but when somebody already did the work, there's no good reason to not review and incorporate it (unless it would cost real money). — Daniel Fischer, Jul 11 '13 at 10:12
Honestly, I do not understand the priority thing given that they invested a good deal to fight the "Java is slow" meme. I am sure even Sun/Oracle people should know that better algorithms are low hanging fruits here. (Apart from the obvious that a mathematician specialised in numeric computations should have supervised the java.lang.BigInteger implementation from the very beginning. But even then, what would it cost to contract one for like 8 weeks to fix it once and for all.) — Ingo, Jul 11 '13 at 10:23
Well, I would think the typical Java application rarely uses `BigInteger`, hence it wouldn't be a top priority. That's what I meant. Their apparent total lack of interest is indeed surprising. — Daniel Fischer, Jul 11 '13 at 10:26

Toxaris · Answer 2 · 2013-07-11T08:50:03.787

10

I first want to point out two factors which are clearly not the reason for the speed difference, but have been mentioned nevertheless, in the question and some answers.

No caching / memoization

The question mentions caching, and some of the answers mention memoization. But the factorial function does not benefit from memoization, because it recursively calls itself with different arguments. So we would never hit an entry in the cache that's already filled, and the whole caching is unnecessary. Maybe people were thinking of the fibonacci function here?

For the record, Haskell would not provide automatic memoization anyway.

No other clever optimization

Both the Java and the Haskell program look already pretty optimal to me. Both programs use the iteration mechanism of choice of their respective languages: Java uses a loop, Haskell uses recursion. Both programs use a standard type for big integer arithmetic.

If anything, the Haskell version should be slower because it is not tail recursive, whereas the Java version uses a loop which is the fastest looping construct available in Java.

I don't see much scope for clever high-level optimizations a compiler could make to these programs. I suspect that the observed speed difference is due to low-level details about how big integers are implemented.

So why is the Haskell version faster?

The Haskell compiler has built-in and reasonable support for Integer. That seems to be less so with Java implementations and the big integer class. I googled for "BigInteger slow" and the results suggest that the question really should be: Why is Java's BigInteger so slow? There seem to be other big integer classes that are faster. I'm not a Java expert, so I cannot answer this variant of the question in any detail.

edited Jul 11 '13 at 08:50

answered Jul 11 '13 at 08:12

Toxaris

7,156
1
21
37

You are right except for "object creation is slow". I am sure the Haskell runtime must do the same amount of "object creation" even if the underlying big integer library uses mutable integers. This is because, from the point of view of the Haskell code, the Integers need to appear pure and immutable. – Ingo Jul 11 '13 at 08:18
Maybe the Haskell runtime does the same amount of object creation, but I think that creating one object in the Haskell heap is faster then creating one object in the Java heap, for low-level reasons. I don't understand modern Java implementations well enough to really say something about, so I try to phrase that part of the answer more carefully. – Toxaris Jul 11 '13 at 08:23
Any references for the claim that object creation "is known to be **much** faster" (emphasis mine) in Haskell vs. Java? I'd doubt that for objects that just set initial values for the components they are made of (and are, in this sense, comparable to Haskell "constructors"). Regarding the method tables, it's just a pointer to the constant pool. Yet, in the case of BigIntegers, it is true that construction implies construction of an int array and I bet the GMP library holds that array in-line. – Ingo Jul 11 '13 at 08:31
Concurrently weakend my claim to remove the *much*. I don't have any evidence about it. Do you think the answer would be better without all mentioning of "object creation"? – Toxaris Jul 11 '13 at 08:38
Indeed, I think that Java BigIntegers use less favorable (slower) algorithms and that this is the main reason for being slower (looking for references to back that claim meanwhile). Add to this a bit of overhead due to having them as Java objects. The most fair comparision, thus, would be to use a Java GMP implementation. – Ingo Jul 11 '13 at 08:43
I removed the argument about object creation alltogether. Thanks for pointing me into looking into the matter. Looks like I was guitly of remembering some outdated experiences for too long. – Toxaris Jul 11 '13 at 08:55
See, for example https://github.com/tbuktu/bigint -- from this I take it that big performance improvements with just Java are possible. Conversely, that the standard algorithms are sub-optimal (especially regarding multiplication of big numers, and I we know that factorials get huge very quickly). – Ingo Jul 11 '13 at 09:09
@Ingo no, when you're using mutable objects, you only need two objects: the product, and the current multiplicand. You can then increment the multiplicand without creating a new multiplicand object. You can also multiply into the product without creating a new product object. Java's immutable `BigInteger`s require a creation of two new `BigInteger`s per multiplication. I'm confident that GHC can elide object creation for objects that disappear immediately and can never otherwise be used by the programmer elsewhere in the code. – Chai T. Rex Jun 27 '18 at 05:35

score 9 · Accepted Answer · edited May 23 '17 at 11:46

Here's a related question: https://softwareengineering.stackexchange.com/q/149167/26988

It seems that in this particular case, you're seeing the difference in optimizations of a pure vs impure function. In Haskell, all functions are pure unless they're doing IO (see link).

I think what's going on is that GHC is able to optimize the code better because of the guarantee of purity. Even though there isn't a tail call, it knows there aren't any side effects (because of purity guarantee), so it can do some optimizations the Java code can't ~~(like automatic caching and whatnot as @andrew mentions in his answer)~~.

A better solution in Haskell would be to use the built-in product function:

factorial n = product [1..n]

This is able to do tail-call optimization because it's just iteration. The same can be done in Java with a for loop as in your example, but it won't have the benefit of being functionally pure.

Edit:

I assumed tail-call elimination was happening, but it apparently is not. Here's the original answer for reference (it still has useful information about why Haskell may be faster that Java in some recursive contexts).

Functional programming languages like Haskell take advantange of tail call elimination.

In most programming languages, recursive calls maintain a call stack. Each recursive function allocates a new stack, which isn't cleaned up until it returns. For example:

call fact()
    call fact()
        call fact()
        cleanup
    cleanup
cleanup

Functional languages, however, don't need to maintain a stack. In procedural languages, it's often difficult to tell if the return value will be used by the caling function, so it's hard to optimize. In FP, however, the return value only makes sense when the recursion is complete, so you can eliminate the call stack and end up with something like this:

call fact()
call fact()
call fact()
cleanup

The call fact() lines can all happen in the same stack frame because the return value isn't needed in intermediate calculations.

Now, to answer your question, you can solve this problem in a variety of ways, all of which aim to eliminate the call stack:

use a for loop instead of recursion (usually the best option)
return void and hope the compiler does tail call elimination
use a trampoline function (similar to the for-loop idea, but it looks more like recursion)

Here are some related questions with examples for the above:

Recursion vs For loops - Factorials, Java (for loop instead of recursion for factorial)
Why does the JVM still not support tail-call optimization? (maybe tail call elimination isn't reliable in Java)
What is a trampoline function? (trampoline function in C)

Note:

It's not guaranteed that recursive calls will reuse the same stack frame, so some implementations may reallocate on each recursive call. This is often easier and still provides the same memory safety as reusing the stack frame.

For more information about this, see these articles:

His haskell `factorial` function isn't even tail recursive it must multiply the return of the function by `a` before it returns. Unless laziness is fixing this there must be another answer. — DiegoNolan, Jul 11 '13 at 04:37
Hmm, it doesn't appear to be tail recursive. I still think the information in my answer is useful, however it isn't correct. — beatgammit, Jul 11 '13 at 04:54
GHC does not do "automatic caching", at least not in the sense of memoizing the function (it does sharing, but that's not going on here). — shachaf, Jul 11 '13 at 05:20
@shachaf - Hmm, didn't know that. But I'm pretty sure the purity guarantee is related. Do you by chance know what else it could be? Also, do you have a reference for GHC not memoizing the function? — beatgammit, Jul 11 '13 at 05:23
I don't know of a canonical place to point to to show that GHC *doesn't* do an optimization. :-) Just look at one of the [search results](https://encrypted.google.com/search?q=ghc+memoization) on the topic. GHC *could* do it, but figuring out where to do it is "hard", and that sort of space-for-time trade-off is left for the programmer to make. — shachaf, Jul 11 '13 at 05:52
(Also: I would say "tail call elimination" isn't really an accurate phrase here -- GHC's evaluation model is different enough that it doesn't make sense to talk about the same sort of stack frames in the first place.) — shachaf, Jul 11 '13 at 05:54
As for the actual problem -- I'm pretty sure it's what I said above, i.e. GHC's GMP `Integer` is much faster than Java's `BigInteger`. Almost all the time here is going to be spent in integer multiplication, so that's going to be the big difference. I measured other simple operations in GHC-Integer vs. Java-BigInteger and GHC was much faster. — shachaf, Jul 11 '13 at 05:56
@shachaf is right, it's just the difference between an optimised (GMP) and a rather naive implementation. Since Java's `BigInteger`s are immutable, purity doesn't play a role here. — Daniel Fischer, Jul 11 '13 at 10:08
`factorial 1500000` gives me a "Stack space overflow". GHC stack is pretty predictable. Look at the core output, it doesn't memoize in core, it recurses as written. Maybe STG does some additional optimization steps, but at any rate the stack is overflown, you're just not using enough in your tests. — Christopher Done, Jul 18 '13 at 12:10
@ChristopherDone You're completely correct. The highest upvoted answer is actually correct. I'd remove this answer, but I think it still has relevant information. — beatgammit, Jul 18 '13 at 18:06

Ingo · Answer 4 · 2013-07-11T09:06:39.483

4

I think the difference has nothing to do with tail-call optimization, or optimization at all. The reason I think that is that the optimization can, at it's best, only achieve something that is like your iterative Java version.

The real reason is, IMHO, that Java BigIntegers are slow compared with Haskell's.

To establish this, I propose 2 experiments:

Use the same algorithms, but use long. (The results will be some garbage for higher numbers, but the computations will be done nevertheless.) Here, the Java version should be on par with Haskell.
Use a faster big integer library in the java version. The performance should improve accordingly. There are wrappers for GMP out there, as well as improvements to the java big integers like here. The manyfold performance imporvements possible for multiplication of large numbers are telling.

edited Jul 11 '13 at 09:06

answered Jul 11 '13 at 08:10

Ingo

36,037
5
53
100

+1 for proposing experiments. But note that with long, "the computations" will not be done nevertheless. The loops will be run, but the amount of work we do in each loop iteration goes down a lot, so the performance characteristics might be quite different. – Toxaris Jul 11 '13 at 08:32
Another good experiment would be to compare "Long" and "long" in Java. – Toxaris Jul 11 '13 at 08:32
@Toxaris For "Long" one could make the task to compute the factorials of the numbers in a list, i.e. `map fac [1..1000000000]` or some such - the time should be dominated then by boxing/unboxing operations on both sides. – Ingo Jul 11 '13 at 08:39

andrew · Answer 5 · 2013-07-11T06:09:06.130

The below explanations are obviously not enough. Here's some slides that explain the transformation a function goes through when it's parameters are strict (like the above example) and no thunks are generated: http://www.slideshare.net/ilyasergey/static-analyses-and-code-optimizations-in-glasgow-haskell-compiler

The Haskell version will be doing just the calculation storing just the previous calculation and the applying the next one, eg 6 x 4. Whereas the Java version is doing caching (all the historic values), memory management, GC and the like.

It's doing strictness analysis and it's automatically caching the previous computation. See: http://neilmitchell.blogspot.com.au/2008/03/lazy-evaluation-strict-vs-speculative.html?m=1

More details are on the Haskell Wiki: "Optimising compilers like GHC try to reduce the cost of laziness using strictness analysis, which attempts to determine which function arguments are always evaluated by the function, and hence can be evaluated by the caller instead."

"Strictness analysis can spot the fact that the argument n is strict, and can be represented unboxed. The resulting function won't use any heap while it is running, as you'd expect."

"Strictness analysis is a process by which GHC attempts to determine, at compile-time, which data definitely will 'always be needed'. GHC can then build code to just calculate such data, rather than the normal (higher overhead) process for storing up the calculation and executing it later."

http://www.haskell.org/haskellwiki/Performance/Strictness http://www.haskell.org/haskellwiki/GHC_optimisations

Strictness analysis is a good optimization for Haskell code, but hardly relevant compared to similar code in Java, where everything is strict! The CPR/worker-wrapper optimization (i.e. unboxing) GHC would do for `Int` is also not relevant here, if the code is being called with `Integer`. And as I wrote above, GHC doesn't do "automatically caching", so that doesn't affect things here either. — shachaf, Jul 11 '13 at 05:23
Yep, I know what strictness analysis does. I also know what kind of code the Haskell function gets compiled into -- even when inlined/strictness-analyzed/CPRed/whatever -- and I really doubt it's fair to attribute very much to it that code (though of course it's hard to make any completely definitive statements without seeing the full program). — shachaf, Jul 11 '13 at 05:31

Why is factorial calculation much faster in Haskell than in Java

5 Answers5

No caching / memoization

No other clever optimization

So why is the Haskell version faster?

Linked