3

I'm observing the following execution times:

  • ~9s when compiled with stack build
  • ~5s when compiled with stack build --profile

I'd expect non-profiled execution time to be below 1s, which tells me that the profiled time above makes sense, it's the non-profiled time that is abnormally slow.

A few more details:

  • The program reads a relational algebra-like DSL, applies a series of rule-based transformations, and outputs a translation to SQL. Parsing is done with megaparsec. I/O is String based and is relatively small (~ 150 KBs). I would exclude I/O as a source of the problem. Transformations involve recursive rewriting rules over an ADT. In a few occasions, I use ugly-memo to speed up such recursive rewrites.
  • Using stack 2.9.1 with LTS 18.28, ghc 8.10.7
    (EDIT: upgrading to LTS 20.11, ghc 9.2.5, does not help)
  • In the cabal file:
  ghc-options:        -O2 -Wall -fno-warn-unused-do-bind -fno-warn-missing-signatures -fno-warn-name-shadowing -fno-warn-orphans
  ghc-prof-options:   -O2 -fprof-auto "-with-rtsopts=-p -V0.0001 -hc -s"
  • Notice that none of the above is new, but I have never observed this behaviour before.
  • I've seen this related question, but compiling with -fno-state-hack does not help
  • Compiling with -O1 doesn't help (about the same as -O2), and -O0 is significantly slower, as expected.
  • The profiling information shows no culprit. The problem only shows up with non-profiled execution.

I realise I'm not giving many details. In particular, I'm not including any code snippet because it's a large code base and I have no idea of which part of it could trigger this behaviour. My point is indeed that I don't know how to narrow it down.

So my question is not "where is the problem?", but rather: what could I do to get closer to the source of the issue, given that the obvious tool (profiling) turns out to be useless in this case?

cornuz
  • 2,678
  • 18
  • 35
  • 1
    That's strange indeed, but while I accept that you can't post your entire code here, you should still make it _somehow_ reproducible or at least provide information on what kind of executable this is you're building, specifically what kind of workload it's doing. – leftaroundabout Feb 20 '23 at 09:13
  • @leftaroundabout true. I added a few more details. – cornuz Feb 20 '23 at 09:31
  • Try compiling with `-fno-full-laziness`. It's the "optimization" most likely to misfire in a severely negative manner. – Carl Feb 20 '23 at 11:51
  • @Carl thanks for your suggestion, it doesn't seem to be it. Compiling with `-fno-full-laziness` makes the non-profile execution 4x slower. – cornuz Feb 20 '23 at 12:00
  • 1
    "What could I do to get closer to the source of the issue?" -- Perhaps you could manually specify cost centers for your profiling, then progressively delete some until you get the bad behavior. Presumably whichever one(s) you deleted last would then be (partially) responsible. – Daniel Wagner Feb 20 '23 at 14:32

2 Answers2

3

Though I still don't know the exact reason for the difference between profile and non-profile execution, I now found the culprit in my code.

A function that recurse over the ADT was implemented as follows:

f :: (?env :: Env) => X -> Y
f = memo go
where
  go :: (?env :: Env) => X -> Y
  go (...) = ...
  go x = f (children x) -- memoised recursion over x's children

Note that:

  • each recursion is memoised, as it the data structure may contain many duplicates
  • the function uses an Env parameter, which is used internally but never changes during the whole recursion. Note that it is declared as an implicit parameter.

Profiling info about this function showed many more function calls than expected. Three orders of magnitude more than expected! (Still pretty weird that, despite this, profile execution was reasonably fast).

Turns out, the implicit parameter was the issue. I don't know the exact reason, but I had read blogs about how evil implicit parameters can be and now I have first-hand proof of that.

Changing this function to using only explicit parameters fixed it completely:

f :: Env -> X -> Y
f env = memo (go env)
where
  go :: Env -> X -> Y
  go env (...) = ...
  go env x = f env (children x) -- memoised recursion over x's children

EDIT: Thanks to @leftaroundabout for pointing out that my solution above isn't very correct - it under-uses memoisation (see discussion below). The correct version should be:

f :: Env -> X -> Y
f env = f_env
where 
  f_env = memo $ go env
  go env (...)
  go env x = f_env (children x)

What the implicit parameter seemed to cause is that uglymemo failed to recognise previously stored calls, so I got all the overhead that comes with it and no benefit.

My problem is kind of solved, but if someone can explain the underlying reasons for which implicit parameters can cause this behaviour, and why would profile execution be faster, I'd still be very interested to learn that.

cornuz
  • 2,678
  • 18
  • 35
  • Interesting, apparently the implicit argument is flipped with the explicit one, so that you end up memoising only a bunch of functions instead of actual values of these functions. And enabling profiling somehow prevents this flipping. — I'd be curious whether this also happens with other memoisation libraries, or just with ugly-memo. (Which, yes, is ugly... very much and rather needlessly.) – leftaroundabout Feb 20 '23 at 13:47
  • 1
    ...no, wait... isn't the new version actually wrong? This _shouldn't_ memoise, because `f env` is a function. What you should actually write is `f env = f_env where {f_env = memo $ go env; go ...; go env x = f_env (children x)`. – leftaroundabout Feb 20 '23 at 13:50
  • But `env` is constant, so memoising `f env` or `f_env` is the same, right? – cornuz Feb 20 '23 at 14:12
  • No, memoising `f env` will not ensure that another mention of `f env` uses the memoised values, even though `env` is constant across both sites. – leftaroundabout Feb 20 '23 at 14:20
  • hmm interesting, I still can't grasp why. Memo takes a function `f::(a -> b)` and returns a new function `f'::(a -> b)`. `(f env)` is still a function `(a -> b)` and memo will return something that you could think of as `(f env)'::(a -> b)`. Disabling memo gives me a 10x slowdown, so it is in fact memoising. – cornuz Feb 20 '23 at 14:51
  • Sure, the problem is that the first call to `f env` would generate one lookup table and then the next call to `f env` generates another lookup table and so on. The whole thing would still be memoised, but the recursion levels wouldn't share the lookup table. How good or bad that is depends on the exact call pattern, but at any rate I don't suppose that's what you intended. – leftaroundabout Feb 20 '23 at 14:56
  • OK, we're probably getting closer, please bear with me :) I don't get why, if `env` is always the same, a second invocation would generate a new lookup table. – cornuz Feb 20 '23 at 14:59
  • OUCH. I saw it. Each `memo` call generates a new lookup table, no matter what the function is! I need to define a new function (via `memo`) only once, and then reuse it through the recursion. That is what you mean, right? – cornuz Feb 20 '23 at 15:07
  • Because `f env` is not bound to a name. Only named, monomorphic entities are guaranteed to share storage. Now, sure – denotationally speaking, a compiler could decide to automatically bind `f env` to a shared storage. But GHC generally doesn't do this, for a bunch of reasons including **1.** in the general case, it could be much more difficult to determine when to do it than in this example, and if heuristics are applied performance would become very unpredictable **2.** it is not always desirable to share, sometimes it would just mean you allocate a lot of memory that isn't reclaimed. – leftaroundabout Feb 20 '23 at 15:09
  • I see. Thanks a lot for taking the time to explain that. – cornuz Feb 20 '23 at 15:17
  • I looked again at my code, it turns out that I followed exactly the pattern suggested by @leftaroundabout (probably taken by some correct example). The code I have written above was a (wrong) attempt to "simplify" it, by removing an extra definition that in my eyes was superfluous. – cornuz Feb 20 '23 at 15:24
  • @cornuz I wonder what would happen if you reintroduced the implicit parameter in the final version. I'd try adding `?env` only to the top-level signature, so that `f_env` is still memoized. That might lead to the expected performance without changing the type. – chi Feb 20 '23 at 18:36
0

This sounds like a contradiction: "So my question is not "where is the problem?", but rather: what could I do to get closer to the source of the issue"

Run it under a debugger, and just manually pause it during those 9s or 5s. Do this several times. The stack will show you exactly how it's wasting time, whether I/O or not. That's this technique.

What have you got to lose?

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135