What is a pure functional approach to indexing functions?

Question

For example, a function that receives a list of trades and returns a list of value sums, indexed by time:

trades = [{time:1,value:8}, {time:1.1,value:8},... {time:1.2,value:7}, time:2.1,value:8} ...]
total_value_by_time = {}

for trade in trades
    if not exists(total_value_by_time[trade.time])
        total_value_by_time[trade.time] = 0
    total_value_by_time[trade.time] += trade.value

I couldn't figure out how to replicate this algorithm with none of the common FP approaches such as map and reduce. What is a pure functional way to do it?

Too short for an answer: use foldl. It's a pretty standard approach too — Guido, Dec 01 '13 at 18:53
@Guido an answer could include more than that, though. I'm not certain how I could do that with foldl. Actually I kind of am! Oh. Be kind enough and answer me something: is it "ok" to use foldl to generate data larger than that you started with? As, for example, `(foldl (λ(accum x)(concat accum [x x]) [] arr)` — MaiaVictor, Dec 01 '13 at 18:55
@Viclib, I misunderstood the question, sorry. You _can_ use a map reduce approach, I'm writing it and I'll post it as an answer. — Guido, Dec 01 '13 at 19:14

leftaroundabout · Answer 1 · 2013-12-01T21:23:34.133

4

I'd say the most natural solution is to first group the list by equal times, then sum up each group's values. In Haskell,

tradesAccum = sortBy (compare`on`time)
          >>> groupBy ((==)`on`time)
          >>> map (map value >>> sum)

In case you try this and don't know where to find the necessary standard functions:

import Data.List (sortBy, groupBy)
import Data.Function (on)
import Control.Arrow ((>>>))

We can also make this nicely parallelisable and as efficient as with Map, but still use only lists. This is basically a variation of the above, but completely implemented as a prune-enabled, parallel merge sort:

import Control.Parallel.Strategies

uniqueFstFoldSnd :: (Ord a, Semigroup b) => [(a, b)] -> [(a, b)]
uniqueFstFoldSnd [] = []
uniqueFstFoldSnd [x] = [x]
uniqueFstFoldSnd l = uncurry merge .
    (withStrategy $
        if len>100 then parTuple2 (evalList r0) (evalList r0)
                   else r0
    ) $ uniqueFstFoldSnd *** uniqueFstFoldSnd $ splitAt (len `quot` 2) l
  where merge [] ys = ys
        merge xs [] = xs
        merge ((k, u):xs) ((l, v):ys)
         | k < l      = (k, u   ) : merge        xs  ((l,v):ys)
         | k > l      = (l, v   ) : merge ((k,u):xs)        ys
         | otherwise  = (k, u<>v) : merge        xs         ys
        len = length l

Note that the parallelism doesn't yet give a significant performance improvement; I'm still experimenting with Strategies...

edited Dec 01 '13 at 21:23

answered Dec 01 '13 at 19:10

leftaroundabout

117,950
5
174
319

I assume it's not really parallelised and it boils down to be sequential due to laziness. Maybe some deep strictness can help here? – is7s Dec 02 '13 at 11:53
@is7s: I don't think so, `evalList r0` is quite the right strictness level. It does properly use all available threads, this just doesn't make it faster. I have [experienced such disappointing results before](http://stackoverflow.com/questions/19752983/is-it-possible-to-speed-up-a-quicksort-with-par-in-haskell) when parallelising otherwise un-optimised algorithms; I think the problem is something like that it's memory-bound, because we don't have cache locality at all. – leftaroundabout Dec 02 '13 at 12:02
But the strategies library doesn't mention anything about strictness, am I mistaken? Why not use `rdeepseq` instead of `r0`? I believe `monad-par` gives better strictness guarantees. – is7s Dec 02 '13 at 12:15
@is7s: sure it does mention striction, strategies is pretty much all about that! FWIW, this is from the `parallel` library. Haven't tried `monad-par` yet; it's supposed to be more generally applicable but also unwieldier, isn't it? – leftaroundabout Dec 02 '13 at 12:21
From the strategies library it says "`r0` performs *no* evaluation". I'm just afraid this might mean that the collecting thread will receive all thunks and evaluate everything sequentially. Why not `rdeepseq`? – is7s Dec 02 '13 at 12:28
Because it requires `NFData`, and is in itself much more expensive. `evalList r0` _does_ perform evaluation, namely of the whole list _spine_. – leftaroundabout Dec 02 '13 at 12:33
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/42329/discussion-between-leftaroundabout-and-is7s) – leftaroundabout Dec 02 '13 at 12:41

score 2 · Answer 2 · answered Dec 01 '13 at 20:13

2

There is a function for this exposed as part of the Data.Map API. Your example boils down to fromListWith (+).

answered Dec 01 '13 at 20:13

Daniel Wagner

145,880
9
220
380

This sounds promising, but could add a little more on how fromListWith works for this case? – MaiaVictor Dec 01 '13 at 20:46
@Viclib The source for `fromListWith` is available from the "Source" hyperlink in the documentation I linked, and is pretty readable. It is essentially a fold that does insertion at each step. – Daniel Wagner Dec 01 '13 at 20:54

daniel gratzer · Answer 3 · 2013-12-01T19:48:57.047

1

Well here's how I'd write your code in Haskell

import Data.Map as M
import Data.List(foldl')

total :: [(Double Integer)] -> Map (Double, Integer)
total = foldl' step M.empty
  where step m (key, val) | member key m = M.update key (+val) m
                          | otherwise    = M.insert key val m

In general folds are the functional approach to iteration and you use them to replace loops that accumulate things. In this specific case, you could also use group

edited Dec 01 '13 at 19:48

answered Dec 01 '13 at 19:09

daniel gratzer

52,833
11
94
134

This should perform better than my solution, for instance since duplicates are pruned while inserting. Though it depends... – leftaroundabout Dec 01 '13 at 19:16

J. Abrahamson · Answer 4 · 2013-12-01T19:43:50.270

You can think of this function as "breaking down" a list and then building back up a mapping or dictionary from the results. It makes for a relatively uninteresting map-reduce problem since everything is in the reduce.

import qualified Data.Map as Map
import           Data.Map (Map)

type Time = Double
type Value = Double
data Trade = Trade { time :: Time, value :: Value }

-- given some `mapReduce` function...
accum = mapReduce mapper reducer where
  mapper :: Trade -> Map Time Value
  mapper tr = Map.singleton (time tr) (value tr)

  -- This inherits the associativity of (+) so you can 
  -- reduce your mapper-generated `Map`s in any order. It's 
  -- not idempotent, though, so you must ensure that each datum
  -- is added to the reduction exactly once. This is typical
  -- for map reduce
  reducer :: [Map Time Value] -> Map Time Value 
  reducer maps = Map.unionsWith (+)

-- without parallelization this looks like you'd expect
--     reducer . map mapper :: [Trade] -> Map Time Value

Where the interesting Map functions come from the Haskell containers package: Map.singleton and Map.unionsWith.

Generally, "breaking down" and "reducing" are all algorithms called "catamorphisms" (cata- is a Greek prefix for breaking "downward", just like "catabolism"). Pure functional programs are absolutely amazing at doing catamorphisms as they are usually "fold"s of some kind.

That said, we can write this same algorithm as a fold in just one pithy line. We'll use Data.Map.Strict and foldl' to make sure that this Haskell code doesn't generate any spare, useless thunks.

import qualified Data.Map.Strict as Map

accum :: [Trade] -> Map Time Value
accum = foldl' (\oldMap tr -> Map.insertWith (+) (time tr) (value tr) oldMap) Map.empty

To connect to @DanielWagner's answer, note that `Map.fromListWith f == Map.unionsWith f . map (uncurry Map.singleton)`. — J. Abrahamson, Dec 01 '13 at 20:31

score 0 · Answer 5 · answered Dec 01 '13 at 19:24

MapCollectReduce approach.

insert (a,b) [] = [(a,[b])]
insert (a,b) ((k, vs):rest) | a == k    = (k, b:vs):rest
                            | otherwise = (k, vs):(insert (a,b) rest)

collect ((k,v):kvs) = insert (k,v) (collect kvs)
collect [] = []

trades l = map (\(k, vs) -> (k, sum vs)) $ collect l

I coded a very primitive collect function, that performs pretty horribly. Once you have that, you take your data and do a map (not here in this case, consider it a map id). Then you collect the pairs, meaning, you group the pairs by it's key. And finally you calculate on the collected data: you sum all the values for a given key.

@leftaroundabout and @jozefg's answers probably outperform this one by a mile, but with a good library for mapCollectReduce I believe this will be faster. (This also parallelizes great, but I don't think that's of importance to you)

Instead of doing the linear traverse of your summarization association list, you can get asymptotic improvements by using a `Map` structure. You can do even better with a HashMap or IntMap if you can map times to those values. It'll also save a lot on the collect step which can be done on subsets of the result types still. Take a look at the map-reduce method I outlined in my answer. — J. Abrahamson, Dec 01 '13 at 19:42

What is a pure functional approach to indexing functions?

5 Answers5