3

I have the following function that convert a list of counts to a discrete probability density function:

freq2prob l = [ (curr / (sum l))) | curr <- l ]

Unfortunately (sum l) is computed for each of the l elements making the computational complexity unnecessarily high.

What is the most concise, elegant, "haskellic" way to deal with this?

Chris Stryczynski
  • 30,145
  • 48
  • 175
  • 286
fstab
  • 4,801
  • 8
  • 34
  • 66
  • 2
    It's simple: `freq2prob l = [ curr / s | let s = sum l, curr <- l ]` – Will Ness Nov 15 '14 at 19:36
  • thanks! But it's not clear why `let s = sum l` is there and not outside the list comprehension. `curr` value is replaced at every iteration. The assignment to `s` is in a place which leads to think it also belong to each iteration. – fstab Nov 15 '14 at 19:39
  • 1
    it's *before* the "iteration specification", so it's done only once. You can even write `[1 | 1<0]`, without any iterations inside. – Will Ness Nov 15 '14 at 19:58

2 Answers2

4

It's simple:

freq2prob l = [ curr / s | let s = sum l, curr <- l ] 

you can put it outside the list comprehension as well: freq2prob l = let s = sum l in [ curr / s | curr <- l ] (notice the in). This is effectively the same computation.

That is because the first is essentially translated into

freq2prob :: (Fractional a) => [a] -> [a]
freq2prob l = [ curr / s | let s = sum l, curr <- l ] 
 = do
     let s = sum l
     curr <- l
     return (curr / s)
 = let s=sum l in
   l >>= (\curr -> [curr / s])
   -- concatMap (\curr -> [curr / s]) l
   -- map (\curr -> curr / s) l

and the second, obviously, to the same code,

freq2prob l = let s = sum l in [ curr / s | curr <- l ]
 = let s = sum l in
   do
     curr <- l
     return (curr / s)
 = let s=sum l in
   l >>= (\curr -> [curr / s])
Will Ness
  • 70,110
  • 9
  • 98
  • 181
4

We can use a let statement or a where clause for this:

freq2prob l = let s = sum l in 
              [ curr / s | curr <- l ]

or

freq2prob l = [ curr / s | curr <- l ] 
    where s = sum l

but it would be more idiomatic to use a higher order function than list comprehension, since you're doing the same thing to each element:

freq2prob l = map (/sum l) l

The sum l in the dividing function (/sum l) will only be evaluated once.

This is because when evaluating map f xs, the compiler doesn't make the elementary mistake of creating multiple copies of the function f to be evaluated separately; it's a thunk that will be pointed to by every occurrence where it's needed.

As a simple and blunt test, we can investigate crude timing stats in ghci for whether it's noticeably faster to use the same function repeatedly or slightly different function each time. First I'll check whether the results of sums are usually cached in ghci:

ghci> sum [2..10000000]
50000004999999
(8.31 secs, 1533723640 bytes)
ghci> sum [2..10000000]
50000004999999
(8.58 secs, 1816661888 bytes)

So you can see it wasn't cached, and that there's a little variance in these crude stats. Now let's multiply by the same complicated thing every time:

ghci> map (* sum [2..10000000]) [1..10]
[50000004999999,100000009999998,150000014999997,200000019999996,250000024999995,300000029999994,350000034999993,400000039999992,450000044999991,500000049999990]
(8.30 secs, 1534499200 bytes)

So (including a little variance, it took almost exactly the same time to multiply ten numbers by sum [2..10000000] using map than multiplying a single one. Multiplying ten pairs of numbers takes hardly any time at all. So ghci (an interpreter, not even an optimising compiler) didn't introduce multiple copies of the same calculation.

This isn't because ghci is clever, it's because lazy evaluation, a nice feature of pure functional programming, never does more work than necessary. In a most programming languages it would be hard to optimise away passing a lengthy calculation around all over the place instead of saving its result in a variable.

Now let's compare that with doing a slightly different calculation each time, where we add up slightly fewer numbers as we go.

ghci> map (\x -> sum [x..10000000]) [1..10]
[50000005000000,50000004999999,50000004999997,50000004999994,50000004999990,50000004999985,50000004999979,50000004999972,50000004999964,50000004999955]
(77.98 secs, 16796207024 bytes)

Well, that took roughly ten times as long, as we expected, because now we're asking it to do a different thing every time. I can verify for you that this paused for each number, whereas when we didn't change the expensive-to-calculate number, it was only evaluated once, and the pause was before the first number and the rest appeared rapidly.

AndrewC
  • 32,300
  • 7
  • 79
  • 115
  • could you clarify why `sum l` in `(/sum l)` is guaranteed to be cached? – Will Ness Nov 15 '14 at 19:55
  • @WillNess Because of fabulous lazy evaluation. The interpreter won't introduce multiple copies of the same constant expression `sum l`. I've edited the answer with a fuller explanation. – AndrewC Nov 15 '14 at 21:10
  • "The interpreter won't introduce multiple copies of the same constant expression `sum l`", but functions (like `(\x-> x / sum l)`) ["are not memoized by GHC"](http://stackoverflow.com/a/3951092/849891) ([see also](https://www.haskell.org/haskellwiki/Constant_applicative_form)). Yes, thanks. – Will Ness Nov 16 '14 at 12:13
  • @WillNess Yes. `x / sum l` is not a constant expression, which is why that doesn't work. I think Constant Applicative Form is exactly the criteria we're looking for. – AndrewC Nov 16 '14 at 14:58