8

I'm trying to understand a performance anomaly being observed when running a program under runhaskell.

The program in question is:

isFactor n = (0 ==) . (mod n)
factors x = filter (isFactor x) [2..x]
main = putStrLn $ show $ sum $ factors 10000000

When I run this, it takes 1.18 seconds.

However, if I redefine isFactor as:

isFactor n f = (0 ==) (mod n f)

then the program takes 17.7 seconds.

That's a huge difference in performance and I would expect the programs to be equivalent. Does anybody know what I'm missing here?

Note: This doesn't happen when compiled under GHC.

Steve
  • 1,555
  • 1
  • 10
  • 15
  • 1
    My guess is that as runhaskell performs only a few optimizations, the second one suffers from certain strictness issues. – fuz Feb 17 '12 at 08:52

2 Answers2

9

Although the functions should be identical, there's a difference in how they're applied. With the first definition, isFactor is fully applied at the call site isFactor x. In the second definition, it isn't, because now isFactor explicitly takes two arguments.

Even minimal optimizations are enough for GHC to see through this and create identical code for both, however if you compile with -O0 -ddump-simpl you can determine that, with no optimizations, this makes a difference (at least with ghc-7.2.1, YMMV with other versions).

With the first isFactor, GHC creates a single function that is passed as the predicate to "GHC.List.Filter", with the calls to mod 10000000 and (==) inlined. For the second definition, what happens instead is that most of the calls within isFactor are let-bound references to class functions, and are not shared between multiple invocations of isFactor. So there's a lot of dictionary overhead that's completely unnecessary.

This is almost never an issue because even the default compiler settings will optimize it away, however runhaskell apparently doesn't even do that much. Even so, I have occasionally structured code as someFun x y = \z -> because I know that someFun will be partially applied and this was the only way to preserve sharing between calls (i.e. GHC's optimizer wasn't sufficiently clever).

John L
  • 27,937
  • 4
  • 73
  • 88
5

As I understand it, runhaskell does little to no optimisation. It's designed to quickly load and run code. If it did more optimisation, it would take longer for your code to start running. Of course, in this case, the code runs faster with optimisations.

As I understand it, if a compiled version of the code exists, then runhaskell will use it. So if performance matters to you, just make sure you compile with optimisations turned on first. (I think you might even be able to pass switches to runhaskell to turn on optimisations - you'd have to check the documentation...)

MathematicalOrchid
  • 61,854
  • 19
  • 123
  • 220
  • Thanks for the answer! I understand that runhaskell doesn't perform many optimizations, but in my head I think `foo x = bar x` and `foo = bar` should generate the exact same code. That's where my confusion stems from. Even without optimizations, I'm just trying to understand why those two would ever perform differently. – Steve Feb 17 '12 at 09:38
  • 1
    I would imagine there's some tiny difference in exactly how the thunks are structured depending on which way you define the function. Something like one version shares one thunk with all invocations, while the other makes a copy for each invocation, leading to much more allocation and deallocation occurring. That would be my _guess_ anyway; you'd have to ask the GHC devs for the "real" reason. Obviously, with optimisations turned on, both ways should end up producing identical code. It's just that without the optimisation pass, that doesn't happen. – MathematicalOrchid Feb 17 '12 at 09:45
  • All optimizations are turned off in both ghci and runhaskell AFAIK, since the interpreter doesn't support unboxed values. – fuz Feb 17 '12 at 10:32
  • Looks like a reasonable guess. When compiled (to object code), both versions produce the same code regardless of optimisation level, so it's a quirk of the byte-code generator. – Daniel Fischer Feb 17 '12 at 11:24
  • @DanielFischer - did you try -O0? I see a difference there (ghc-7.2.1) – John L Feb 17 '12 at 12:29
  • @JohnL admittedly, I tried only with 7.4.1, got identical core there. – Daniel Fischer Feb 17 '12 at 12:31
  • @DanielFischer so 7.4.1 is objectively better, that's good. I feel like I know your system almost as well as my own at this point. – John L Feb 17 '12 at 12:50
  • @MathematicalOrchid you're correct. I was surprised to see that even the literal `0` isn't shared between invocations with the slow version. – John L Feb 17 '12 at 12:51