Fastest way to merge lists that have a common field?

Question

I am learning F# and I'm doing and odds comparison service (ala www.bestbetting.com) to pu theory into practice. So far I have the following structures of data:

type price = {    Bookie : string;    Odds : float32;    }

type selection = {
    Prices : list<price>;
    Name : string;
    }

type event = {    Name : string;    Hour : DateTime;    Sport : string;    Selections : list<selection>;    }

So, I have several of these "Events" coming from several sources. And I would need a really fast way of merging events with the same Name and Hour, and once that is done merge the prices of its different selections that have the same Name.

I've thought about getting the first list and then do a one-by-one search on the other lists and when the specified field matches return a new list containing both lists merged.

I'd like to know if there's a faster way of doing this as performance would be important. I have already seen this Merge multiple lists of data together by common ID in F# ... And although that was helpful, I am asking for the best performance-wise solution. Maybe using any other structure that it's not a list or another way of merging them... so any advice would be greatly appreciated.

Thanks!

I changed the blockquote to code so the syntax is highlighted. — Dan Abramov, Jun 29 '11 at 19:16
By the way, I had already seen this http://stackoverflow.com/questions/4787226/merge-multiple-lists-of-data-together-by-common-id-in-f before posting... Although that was helpful, I am asking for the best performance-wise solution. Thanks. — Jacobo Polavieja, Jun 29 '11 at 19:20
@Jacobo: I think you should edit the question to indicate that.. — Dan Abramov, Jun 29 '11 at 19:21
@Dan Abramov Many thanks! How did you do that? I tried but could not get it done so I just thought F# like highlighting was not supported... Thank you again. — Jacobo Polavieja, Jun 29 '11 at 19:22
@Jacobo: you can click `edit` and see the source yourself. Basically, there're two buttons there, one for blockquote (which prefixes each line with `>`) and another for code (which prefixes each line with four spaces). The correct language is determined by question tag, IIRC (complete rules are googlable, I believe Jeff Atwood wrote a reference somewhere on [meta](http://meta.stackoverflow.com)). — Dan Abramov, Jun 29 '11 at 19:28
Why is the solution you referenced unacceptable? How much better does the performance need to be? The ideal solution is going to involve a hashtable, which `groupBy` does for you. — Daniel, Jun 29 '11 at 20:24
@Daniel, would it be faster if the event/selections are flattened into (ev.Name, ev.Hour, ev.SelectionName, list) and then do a group by on all three? doesn't seem to change the meaning.. — Bala, Jun 30 '11 at 02:49

score 4 · Accepted Answer · answered Jun 29 '11 at 22:37

As Daniel mentioned in the comment, the key question is, how much better does the performance need to be compared to a solution based on standard Seq.groupBy function? If you have a lot of data to process, then it may be actually easier to use some database for this purpose.

If you only need something ~1.7 times faster (or possibly more, depending on the number of cores :-)), then you can try replacing Seq.groupBy with parallel version based on Parallel LINQ that is available in F# PowerPack. Using PSeq.groupBy (and other PSeq functions), you can write something like this:

#r "FSharp.PowerPack.Parallel.Seq.dll"
open Microsoft.FSharp.Collections

// Takes a collection of events and merges prices of events with the same name/hour
let mergeEvents (events:seq<event>) = 
  events 
  |> PSeq.groupBy (fun evt -> evt.Name, evt.Hour)
  |> PSeq.map (fun ((name, hour), events) ->
      // Merge prices of all events in the group with the same Selections.Name
      let selections = 
        events 
        |> PSeq.collect (fun evt -> evt.Selections)
        |> PSeq.groupBy (fun sel -> sel.Name)
        |> PSeq.map (fun (name, sels) ->
            { Name = name
              Prices = sels |> Seq.collect (fun s -> s.Prices) |> List.ofSeq } )
        |> PSeq.toList
      // Build new Event as the result - since we're grouping just using 
      // name & hour, I'm using the first available 'Sport' value 
      // (which may not make sense)
      { Name = name
        Hour = hour
        Sport = (Seq.head events).Sport
        Selections = selections })   
  |> PSeq.toList

I didn't test the performance of this version, but I believe it should be faster. You also don't need to reference the entire assembly - you can just copy source for the few relevant functions from PowerPack source code. Last time I checked, the performance was better when the functions were marked as inline, which is not the case in the current source code, so you may want to check that too.

Hello and sorry for not having been able to answer earlier; it's been very stressful days. I see I should have mentioned it... I wanted a faster way but also value code readability a lot, so your answer seems a perfect fit. As I'm a novice in F# I haven't still used the F# Power Pack Library although I see it referenced in lots of places. Many many thanks for taking the time to put sample code in as it's a very valuable insight in the process of learning. — Jacobo Polavieja, Jul 04 '11 at 22:23
By the way Tomas, although this is not the place, thanks for that great book about F#. Yours and Don Syme's are the ones I'm using as I find them equally great with a different style . It's being a great pleasure to follow your book and I just wished I could have more time dedicated to that task right now. Cheers. — Jacobo Polavieja, Jul 04 '11 at 22:29
@Jacobo - Thanks for the nice words about the book! I hope the answer will be useful, but (as always with parallel programming) you'll need to measure the performance to see if it actually helps (in your particular case). — Tomas Petricek, Jul 05 '11 at 12:44

score 1 · Answer 2 · answered Jun 29 '11 at 20:49

1

I haven't tested it, but I think this would work.

let events = List.init 10 (fun _ -> Unchecked.defaultof<event>) //TODO: initialize to something meaningful

for ((name, hour), evts) in (events |> Seq.groupBy (fun e -> e.Name, e.Hour)) do
  printfn "Name: %s, Hour: %A" name hour
  let prices = 
    seq {
      for e in evts do
        for s in e.Selections do
          for p in s.Prices do
            yield s.Name, p 
    }
    |> Seq.groupBy fst

  for (selectionName, p) in prices do
    printfn "  Selection Name: %s" selectionName
    for (_, price) in p do
      printfn "    %A" price

answered Jun 29 '11 at 20:49

Daniel

47,404
11
101
179

Hi daniel and thank you for taking the time to write the sampled code. As I am new to this language I may be missing something but, is there any reason why this should be faster than the parallel PSeq.GroupBy of the code above? I think I follow what the code does but see no reason why it would be faster. Anything I may not know about? I'll try it though and draw some comparisons just to suffice my curiosity. Thanks again for helping. – Jacobo Polavieja Jul 04 '11 at 22:31
@Jacobo: No, there's no reason why this should be faster than the parallel version, although there is a cost to parallelism which, in some cases, may negate the benefits. As Tomas said, the only way to know for certain is measure. – Daniel Jul 05 '11 at 14:07

Fastest way to merge lists that have a common field?

2 Answers2

Linked