Can I speed up this Haskell algorithm?

Question

I've got this haskell file, compiled with ghc -O2 (ghc 7.4.1), and takes 1.65 sec on my machine

import Data.Bits
main = do
    print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]

The same algorithm in C, compiled with gcc -O2 (gcc 4.6.3), runs in 0.18 sec.

#include <stdio.h>
void main() {
    int count = 0;
    const int max = 123456789;
    int i;
    for (i = 0; i < max; ++i)
        if ((i & (1 << i % 4)) != 0)
            ++count;
    printf("count: %d\n", count);
}

Update I thought it might be the Data.Bits stuff going slow, but surprisingly if I remove the shifting and just do a straight mod, it actually runs slower at 5.6 seconds!?!

import Data.Bits
main = do
    print $ length $ filter (\i -> (i `mod` 4) /= 0) [0..123456789]

whereas the equivalent C runs slightly faster at 0.16 sec:

#include <stdio.h>
void main() {
    int count = 0;
    const int max = 123456789;
    int i;
    for (i = 0; i < max; ++i)
        if ((i % 4) != 0)
            ++count;
    printf("count: %d\n", count);
}

Haskell equivalent of `%` is `rem`, not `mod`. The latter has extra checks for negative numbers, see http://stackoverflow.com/questions/339719 — Vitus, Oct 04 '12 at 15:33
This is trivial enough that you could check the difference in the generated assembly :) ? I don't have a Haskell compiler on this machine, so I can't check it for you.. Seems Vitus is onto something. — Morten Jensen, Oct 04 '12 at 15:34
changing to `rem` made about a 10% difference; 1.45s and 4.9s now for the respective problems. — lobsterism, Oct 04 '12 at 15:37
Does it help if you replace `[0..123456789]` with `([0..123456789] :: [Int])`? Just wondering whether you get some unfortunate `Num` instance deduced (such as `Integer`). — Frerich Raabe, Oct 04 '12 at 15:40
@MortenJensen the C assembly is 45 lines and the Haskell is 2086! — lobsterism, Oct 04 '12 at 15:42
In your second version, the list returned by `filter` contains three times as many elements as in the first. I guess that's why it takes three times as long to run - the slow part is probably building the list. — interjay, Oct 04 '12 at 15:43
@FrerichRaabe That sped up the 2nd calculation nearly 4*, down to 1.17s, but didn't seem to affect the first calculation. They're still way slower than C though. — lobsterism, Oct 04 '12 at 15:51
FWIW, compiling with `-O2 -fllvm -funfolding-use-threshold1000` makes it go fast (about the same as C) for me. (Edit: actually not, only "faster" -- I made some other changes in between that make it go as fast as C.) — kosmikus, Oct 04 '12 at 15:51
This code looks like it can be fused into a loop, but when I look at the core output, the call of `filter` generates a list. That is probably why ghc is slower. — Heatsink, Oct 04 '12 at 16:03
@Heatsink: This won't fuse because [`length` is not a "good consumer"](http://hackage.haskell.org/trac/ghc/ticket/876). — hammar, Oct 04 '12 at 16:47

score 24 · Accepted Answer · edited May 23 '17 at 10:24

24

The two pieces of code do very different things.

import Data.Bits
main = do
    print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]

creates a list of 123456790 Integer (lazily), takes the remainder modulo 4 of each (involving first a check whether the Integer is small enough to wrap a raw machine integer, then after the division a sign-check, since mod returns non-negative results only - though in ghc-7.6.1, there is a primop for that, so it's not as much of a brake to use mod as it was before), shifts the Integer 1 left the appropriate number of bits, which involves a conversion to "big" Integers and a call to GMP, takes the bitwise and with i - yet another call to GMP - and checks whether the result is 0, which causes another call to GMP or a conversion to small integer, not sure what GHC does here. Then, if the result is nonzero, a new list cell is created where that Integer is put in, and consumed by length. That's a lot of work done, most of which unnecessarily complicated due to the defaulting of unspecified number types to Integer.

The C code

#include <stdio.h>
int main(void) {
    int count = 0;
    const int max = 123456789;
    int i;
    for (i = 0; i < max; ++i)
        if ((i & (1 << i % 4)) != 0)
            ++count;
    printf("count: %d\n", count);
    return 0;
}

(I took the liberty of fixing the return type of main), does much much less. It takes an int, compares it to another, if smaller, takes the bitwise and of the first int with 3⁽¹⁾, shifts the int 1 to the left the appropriate number of bits, takes the bitwise and of that and the first int, and if nonzero increments another int, then increments the first. Those are all machine ops, working on raw machine types.

If we translate that code to Haskell,

module Main (main) where

import Data.Bits

maxNum :: Int
maxNum = 123456789

loop :: Int -> Int -> Int
loop acc i
    | i < maxNum = loop (if i .&. (1 `shiftL` (i .&. 3)) /= 0 then acc + 1 else acc) (i+1)
    | otherwise  = acc

main :: IO ()
main = print $ loop 0 0

we get a much closer result:

C, gcc -O3:
count: 30864196

real    0m0.180s
user    0m0.178s
sys     0m0.001s

Haskell, ghc -O2:
30864196

real    0m0.247s
user    0m0.243s
sys     0m0.003s

Haskell, ghc -O2 -fllvm:
30864196

real    0m0.144s
user    0m0.140s
sys     0m0.003s

GHC's native code generator isn't a particularly good loop optimiser, so using the llvm backend makes a big difference here, but even the native code generator doesn't do too badly.

Okay, I have done the optimisation of replacing a modulus calculation with a power-of-two modulus with a bitwise and by hand, GHC's native code generator doesn't do that (yet), so with ```rem4`` instead of.&. 3`, the native code generator produces code that takes (here) 1.42 seconds to run, but the llvm backend does that optimisation, and produces the same code as with the hand-made optimisation.

Now, let us turn to gspr's question

While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).

Well, the original code used ~~Integers and~~ lists, llvm doesn't know too well what to do with these, it can't transform that code into loops. The modified code uses Ints and the vector package rewrites the code to loops, so llvm does know how to optimise that well, and that shows.

⁽¹⁾ Assuming a normal binary computer. That optimisation is done by ordinary C compilers even without any optimisation flag, except on the very rare platforms where a div instruction is faster than a shift.

edited May 23 '17 at 10:24

Community

1
1

answered Oct 04 '12 at 18:17

Daniel Fischer

181,706
17
308
431

Thanks for having a go at my question (and writing a great and lengthy answer in general!), but I must admit I lied: when I said "original code" I actually meant "original with `Int`". – gspr Oct 04 '12 at 18:25
Do you mean "the original code with some type signature(s) that force everything to be `Int`"? Then the point about the lists still applies, llvm wasn't made for that, so it's not particularly good at transforming list code into loops. The `vector` package is, though, if you do your "listish" code right. – Daniel Fischer Oct 04 '12 at 18:29
Oops, I read too fast, and only noticed the part about `Integer` vs `Int`. Thanks! – gspr Oct 04 '12 at 18:35
8

Awesome! So now...how to make the C code perform as fast as the Haskell? :) – lobsterism Oct 04 '12 at 18:51
That's easy (in this case), @lobsterism, give it the same treatment as the Haskell, `clang -O3 original.c && time ./a.out` ;) [In other words, llvm is better at optimising loops than gcc. (For the values of llvm and gcc installed on my system)] – Daniel Fischer Oct 04 '12 at 18:55
So the lessons learned are--hand-write your own recursive functions rather than using function composition, specify Int, and use LLVM when the code is performance critical? Anything else? – lobsterism Oct 04 '12 at 20:29
4

Hand-writing the loops is not always necessary, you can often use a fusion framework like `vector`'s to achieve the same. Which is easier and/or gives better performance depends on what you're used to (I rarely use `Vector`s, I'm used to writing loops, so I usually find that easier, if you're used to using `Vector`s, you'll find that easier), and the structure of the computation (you can always fiddle with hand-written loops to try to squeeze out more, but if you let the fusion framework write the loops, you're tied to what has been done before, but it's not easy to beat a good framework). – Daniel Fischer Oct 04 '12 at 20:52
3

Where they can be used, `Int` and `Word` give the best performance, so use them where possible and speed matters. Strictness and laziness are important. Where an argument needs to be evaluated anyway, make it strict, where you can start producing the result without, keep it lazy. GHC doesn't do many low-level optimisations yet, so be aware of bit-twiddling tricks to substitute other operations (division, primarily). Avoid `div` and `mod` where `quot` and `rem` will do the right thing. Whether LLVM beats the native code generator depends, but it's rare that the NCG is significantly faster. – Daniel Fischer Oct 04 '12 at 20:52
Your answer made me happy initially, but now after a bit of time to digest, shouldn't the compiler not be able to make that optimization for me? After all, I can see it heuristically almost *immediately*, and I'm sure these kinds of patterns occur all the time in Haskell. Is there something preventing such compiler optimizations from being implemented? – lobsterism Oct 09 '12 at 02:55
The big hindrance in getting more optimisations is manpower. There are only a handful of people working on GHC, and they have lots of other things to do. And it's not so easy to make the optimisations general. For the ``i `mod` 4``, you have the problem that you can only replace it with `i .&. 3` if `i` is non-negative, or if you know that you're on a two's complement machine, for `rem`, only if you know `i` is non-negative. Doable, of course, but not a ten-minute-job. Eliminating the list is also not as easy for a compiler (must be general) as for the programmer (specific to the situation). – Daniel Fischer Oct 09 '12 at 10:34
@tibbe I haven't yet learned that those are now available via `Data.Bits`, thanks. So what I had tried was `lshift (I# n) (I# s) = I# (n `uncheckedIShiftL#` s)`. Didn't make any measurable difference in execution time, nor does `unsafeShiftL`. A bit surprising, since in the core, the comparison of the shift distance against 64 is done with plain `shiftL` (but not the comparison against 0, seems GHC knows `i .&. 3` is never negative). – Daniel Fischer Oct 17 '12 at 11:22
@DanielFischer I've verified that `shiftL` and `unsafeShiftL` produce different Core. The former has an extra branch that compares against the word size (32 in my case). It doesn't seem to matter though, likely due to the branch taken always ends up being the same and thus can be predicted reliably (and thus be free). It's still a good practice to use `unsafeShiftL` if you know that the shift amount is smaller than the word size. – tibbe Oct 18 '12 at 04:20

hammar · Answer 2 · 2012-10-04T18:18:17.143

Few things beat a hand-written loop with a strict accumulator:

{-# LANGUAGE BangPatterns #-}

import Data.Bits

f :: Int -> Int
f n = g 0 0
  where g !i !s | i <= n    = g (i+1) (if i .&. (unsafeShiftL 1 (i `rem` 4)) /= 0 then s+1 else s)
                | otherwise = s

main = print $ f 123456789

In addition to the tricks mentioned so far, this also replaces shift with unsafeShiftL, which doesn't check its argument.

Compiled with -O2 and -fllvm, this is about 13x faster than the original on my machine.

_{Note: Testing if bit i of x is set can be written more clearly as x `testBit` i. This produces the same assembly as the above.}

gspr · Answer 3 · 2012-10-04T18:27:40.603

Vector instead of list, fold instead of filter-and-length

Substituting the list for an unboxed vector and the filter-and-length for a fold (i.e. incrementing a counter) improves the time significantly for me. Here's what I used:

import qualified Data.Vector.Unboxed as UV
import Data.Bits

foo :: Int
foo = UV.foldl (\s i -> if i .&. (shift 1 (i `rem` 4)) /= 0 then s+1 else s) 0 (UV.enumFromN 0 123456789)

main = print foo

The original code (with two changes though: rem instead of mod as suggested in the comments, and adding an Int to the signature to avoid Integer) gave:

$ time ./orig 
30864196

real    0m2.159s
user    0m2.144s
sys     0m0.008s

The modified code above gave:

$ time ./new 
30864196

real    0m1.450s
user    0m1.440s
sys     0m0.004s

LLVM

While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).

Original (LLVM):

$ time ./orig-llvm 
30864196

real    0m2.047s
user    0m2.036s
sys     0m0.008s

Modified (LLVM):

$ time ./new-llvm 
30864196

real    0m0.233s
user    0m0.228s
sys     0m0.004s

For comparison, OP's original C code comes in at 0m0.152s user on my system.

This is all GHC 7.4.1, GCC 4.6.3, and vector 0.9.1. LLVM is either 2.9 or 3.0; I have both but can't seem to figure out which one GHC is actually using.

I'll add as an aside that *only* replacing the filter-and-length with a fold (i.e. keeping lists and not using vectors) actually gives me slightly *worse* performance than the original code when *not* using LLVM: ca. 2.6s. With LLVM I get ca. 1.4s, which is about the same improvement as going to vectors without using LLVM. — gspr, Oct 04 '12 at 17:03

score 0 · Answer 4 · answered Oct 16 '12 at 16:54

Try this:

import Data.Bits
main = do
    print $ length $ filter (\i -> i .&. (shift 1 (i `rem` 4)) /= 0) [0..123456789::Int]

Without the ::Int, the type defaults to ::Integer. rem does the same as mod on positive values, and it is the same as % in C. mod on the other hand ist mathematically correct on negative values, but is slower.

int in C is 32bit
Int in Haskell is either 32 or 64bit wide, like long in C
Integer is an arbitrary-bit-integer, it has no min/max values, and its memory size depends on its value (similar to a string).

Can I speed up this Haskell algorithm?

4 Answers4

Vector instead of list, fold instead of filter-and-length

LLVM