Why is this simple Haskell program so slow?

Question

In this trivial program to print all the numbers from 1 to 10000000, a Haskell version and a C version, why is the Haskell one so slow and what commands help to learn how to improve the Haskell program's performance?

Below is a report containing all the details necessary to reproduce my exciting event, the sources are printed when making the report including the source of the Makefile:

$ make -B report
cat Foo.hs
import Data.Foldable
main = traverse_ print [1..10000000]
cat Fooc.c
#include <stdio.h>

int main()
{
    for (int n = 0; n < 10000000; ++n)
    {
        printf("%d\n", n+1);
    }
}
ghc -O3 Foo.hs -o Foo
time ./Foo | tail -n1
3.45user 0.03system 0:03.49elapsed 99%CPU (0avgtext+0avgdata 4092maxresident)k
0inputs+0outputs (0major+290minor)pagefaults 0swaps
10000000
cc -O3    Fooc.c   -o Fooc
time ./Fooc | tail -n1
0.63user 0.02system 0:00.66elapsed 99%CPU (0avgtext+0avgdata 1468maxresident)k
0inputs+0outputs (0major+63minor)pagefaults 0swaps
10000000
cat Makefile
.PHONY: printFoo printFooc printMakefile
printFoo: Foo.hs
    cat $^

printFooc: Fooc.c
    cat $^

printMakefile: Makefile
    cat $^

Fooc: CFLAGS=-O3
Fooc: Fooc.c
Foo: Foo.hs
    ghc -O3 $^ -o $@

.PHONY: timeFoo timeFooc
timeFoo: Foo
    time ./$^ | tail -n1
timeFooc: Fooc
    time ./$^ | tail -n1

.PHONY: report
report: printFoo printFooc timeFoo timeFooc printMakefile

Also note both of your timings include two different things: the computations that generate the values, and the I/O that prints them. Your results would be more informative if you kept them apart. For instance `print (last [1::Int .. 10000000])`) will print just the last element of the list (while using `Int` instead of `Integer`, as the other folks here suggest). — duplode, Jul 20 '19 at 13:20
People, if you're going to comment make sure you actually have a point! `Integer` vs constant-size types has little to do with the performance issue here. Nor has the unboxed list structure. The fact is, `print` appears to be indeed significantly slower than `printf` in C. But **a)** it's pretty silly to be concerned about the performance of _printing_, because if you have so many numbers that it would matter then [don't print them](https://what-if.xkcd.com/59/)! (Always use binary formats for large data amounts.) ... — leftaroundabout, Jul 20 '19 at 15:22
...So saying “Haskell is slow” because of this is extremely unfair, it's just that _a particular task where it shouldn't really matter is slow_. **b)** it's not _so darn slow_. A factor 5, many languages can only dream of getting that close to C – in tasks where performance really _would_ matter! Reasonably well-written Haskell is as a rule of thumb a factor 2 slower than C; sometimes even a simple version is close to C and with some effort it's almost always possible to achieve that. — leftaroundabout, Jul 20 '19 at 15:22
I should point out that both programs had the same economic cost to write and similar costs to understand so the "code you wrote is slow" comment isn't very informative. — codeshot, Jul 20 '19 at 15:34
@duplode a computation without I/O is pointless. We could replace it with a heating element and a single print of the final number (which I was told back in 2001 Haskell would do and it was a problem for the debian shootout because it ran all the examples at compile time - but now I come to learn it, it no longer has any natural decent performance at all). — codeshot, Jul 20 '19 at 15:38
If you read the question you'll know more, this question is easier to read than others like https://stackoverflow.com/questions/35027952/why-is-haskell-ghc-so-darn-fast — codeshot, Jul 20 '19 at 15:44
@ThomasM.DuBuisson, this question doesn't involve a strict function saving up a stack of operations. — codeshot, Jul 20 '19 at 15:47
@codeshot (1) My point is that your benchmark is, in principle, measuring two things rather than one. If you actually want to measure the performance of `print`ing ten million elements to the console, your results are meaningful; think of my comment as an attempt to clarify whether that is really the case. (2) There are ways to find out whether the optimisation you describe actually happens: you can check the Core (the intermediate code generated by the compiler), or simply check how the time spent varies with the length of the list. — duplode, Jul 20 '19 at 16:58
By separate the computation from the I/O do you mean I should do (1 <$ [1..10000000]) so no numbers are computed? It helps the speed a bit, but doesn't answer the question - I suppose that's why it's a comment and not an answer. — codeshot, Jul 20 '19 at 17:03
@codeshot If you just want to print ten million times you don't need a list: `replicate 10000000 (print 1)` will do. In any case, it won't matter, as you note, because the I/O dwarfs the iteration from `1` to `10000000`; furthermore, even not considering that, with your usage pattern the performance cost of using a list should be entirely negligible. (What I originally meant was that the I/O would be a problem if you were trying to benchmark the iteration, but by now it seems clear your question isn't about that.) — duplode, Jul 20 '19 at 18:05

Thomas M. DuBuisson · Answer 1 · 2019-07-20T23:53:26.367

On my system your Haskell code takes about 3.2 seconds.

N.B. your C code takes...

time ./fooc | tail -n1
ld: warning: directory not found for option '-L/opt/local/lib'
10000000
./fooc  0.92s user 0.03s system 33% cpu 2.863 total
tail -n1  2.85s user 0.01s system 99% cpu 2.865 total

So do note the difference of time a | b and what that means vs time (a | b).

Haskell is slow in part because (some of this is hypothesis)

By default print and the underlying putStrLn use String which is a linked list of characters.
UTF encoding
RTS differences

For 1, the packed variant using Text doesn't perform much differently, perhaps due to issue 2.

For 2, the ByteString variant (packed bytes instead of characters) is more representative of what your C program is doing:

-- Using functions from the Relude package
main = traverse_ putBSLn (show <$> [(1::Int)..10000000])

Results in

10000000
./foo  1.55s user 0.08s system 56% cpu 2.904 total

So the CPU time is closer to your C program, leading me to hypothesize that this difference is largely about the unnecessary UTF8 handling built into the routines you use by default in Haskell's prelude.

Dead-ends:

I tried NoBuffering and large BlockBuffering with no luck.
Constructing a large bytestring and printing with a single call wasn't any better (lazy or strict bytestrings).
Printing via Text instead of String gave only the barest improvement.
Rendering directly to ByteString instead of by packing values showed into Strings. This could be a win, I expect, if done well.

EDIT: I can't believe I forgot about Builder, which is an optimized way to build bytestrings and, in some cases, it fuses well to reduce allocations. Builder is the underpinning of the above example I already showed but using it directly allows some manual optimization.

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Builder
import System.IO (stdout)
import Data.Foldable

main :: IO ()
main = do
    traverse_ (hPutBuilder stdout . (<>"\n") . intDec) [(1::Int)..10000000]

Performing at:

./foo  1.05s user 0.13s system 38% cpu 3.048 total
tail -n1  3.02s user 0.01s system 99% cpu 3.047 total

And indeed this is more efficient than the prior, many separate, calls to hPut because as hPutBuilder says:

This function is more efficient than hPut . toLazyByteString because in many cases no buffer allocation has to be done. Moreover, the results of several executions of short Builders are concatenated in the Handles buffer, therefore avoiding unnecessary buffer flushes.

So I should add: 4. Haskell was slow in this case because sometimes computations do not fuse and you end up with excess allocation, which isn't free.

Another thing that made a big difference in my testing: constructing the full string with newlines and all and then just calling `putStr(Ln)` once for the whole `String` instead of once per line. It seems there's a bunch of front-matter in `putStrLn` before it gets to actually dumping characters to the output buffer. It may be that `putBSLn` has a similar issue and you could get a second cheap win -- although unless you're using lazy bytestrings the memory cost may be something to consider! — Daniel Wagner, Jul 20 '19 at 17:47
Another thing to consider is the RTS setup time. Perhaps even `main = pure ()` takes some time only because the RTS needs to setup GC, etc. (did not test this, though). — chi, Jul 20 '19 at 20:02
main = pure () was reported at 0.00 elapsed. So basic RTS setup isn't an issue. — codeshot, Jul 20 '19 at 22:40
Thanks for this, but a key part of the question is missing: How does one figure out the cause of performance problems like this (and the remaining 2/5 of a second). One needs to understand what to improve next or it has to be decidable that no improvement is going to be practical. — codeshot, Jul 21 '19 at 09:08

Why is this simple Haskell program so slow?

1 Answers1