3

I am looking to merge two Deedle (F#) frames based on a specific column in each frame in a similar manner as pandas.DataFrame.Merge.The perfect example of this would be a primary frame that contains columns of data and a (city, state) column along with an information frame that contains the following columns: (city, state); lat; long. If I want to add the lat long columns into my primary frame, I would merge the two frames on the (city, state) column.

Here is an example:

    let primaryFrame =
            [(0, "Job Name", box "Job 1")
             (0, "City, State", box "Reno, NV")
             (1, "Job Name", box "Job 2")
             (1, "City, State", box "Portland, OR")
             (2, "Job Name", box "Job 3")
             (2, "City, State", box "Portland, OR")
             (3, "Job Name", box "Job 4")
             (3, "City, State", box "Sacramento, CA")] |> Frame.ofValues

    let infoFrame =
            [(0, "City, State", box "Reno, NV")
             (0, "Lat", box "Reno_NV_Lat")
             (0, "Long", box "Reno_NV_Long")
             (1, "City, State", box "Portland, OR")
             (1, "Lat", box "Portland_OR_Lat")
             (1, "Long", box "Portland_OR_Long")] |> Frame.ofValues

    // see code for merge_on below.
    let mergedFrame = primaryFrame
                      |> merge_On infoFrame "City, State" null

Which would result in 'mergedFrame' looking like this:

> mergedFrame.Format();;
val it : string =
  "     Job Name City, State    Lat             Long             
0 -> Job 1    Reno, NV       Reno_NV_Lat     Reno_NV_Long     
1 -> Job 2    Portland, OR   Portland_OR_Lat Portland_OR_Long 
2 -> Job 3    Portland, OR   Portland_OR_Lat Portland_OR_Long 
3 -> Job 4    Sacramento, CA <missing>       <missing>   

I have come up with a way of doing this (the 'merge_on' function used in the example above), but being a Sales Engineer who is new to F#, I imagine there is a more idiomatic/efficient way of doing this. Below is my functions for doing this along with a 'removeDuplicateRows' which does what you would expect and was needed for the 'merge_on' function; if you want to comment on a better way of doing this as well, please do.

    let removeDuplicateRows column (frame : Frame<'a, 'b>) =
             let nonDupKeys = frame.GroupRowsBy(column).RowKeys
                              |> Seq.distinctBy (fun (a, b) -> a) 
                              |> Seq.map (fun (a, b) -> b)  
             frame.Rows.[nonDupKeys]


    let merge_On (infoFrame : Frame<'c, 'b>) mergeOnCol missingReplacement 
                  (primaryFrame : Frame<'a,'b>) =
          let frame = primaryFrame.Clone() 
          let infoFrame =  infoFrame                           
                           |> removeDuplicateRows mergeOnCol 
                           |> Frame.indexRows mergeOnCol
          let initialSeries = frame.GetColumn(mergeOnCol)
          let infoFrameRows = infoFrame.RowKeys
          for colKey in infoFrame.ColumnKeys do
              let newSeries =
                  [for v in initialSeries.ValuesAll do
                        if Seq.contains v infoFrameRows then  
                            let key = infoFrame.GetRow(v)
                            yield key.[colKey]
                        else
                            yield box missingReplacement ]
              frame.AddColumn(colKey, newSeries)
          frame

Thanks for your help!

UPDATE:

Switched Frame.indexRowsString to Frame.indexRows to handle cases where the types in the 'mergOnCol' are not strings.

Got rid of infoFrame.Clone() as suggested by Tomas

1 Answers1

1

The way Deedle does joining of frames (only in row/column keys) sadly means that it does not have a nice built-in function to do joining of frames over a non-key column.

As far as I can see, your approach looks very good to me. You do not need Clone on the infoFrame (because you are not mutating the frame) and I think you can replace infoFrame.GetRow with infoFrame.TryGetRow (and then you won't need to get the keys in advance), but other than that, your code looks fine!

I came up with an alternative and a bit shorter way of doing this, which looks as follows:

// Index the info frame by city/state, so that we can do lookup
let infoByCity = infoFrame |> Frame.indexRowsString "City, State"

// Create a new frame with the same row indices as 'primaryFrame' 
// containing the additional information from infoFrame.
let infoMatched = 
  primaryFrame.Rows
  |> Series.map (fun k row -> 
      // For every row, we get the "City, State" value of the row and then
      // find the corresponding row with additional information in infoFrame. Using 
      // 'ValueOrDefault' will automatically give missing when the key does not exist
      infoByCity.Rows.TryGet(row.GetAs<string>("City, State")).ValueOrDefault)
  // Now turn the series of rows into a frame
  |> Frame.ofRows

// Now we have two frames with matching keys, so we can join!
primaryFrame.Join(infoMatched)

This is a bit shorter and maybe more self-explanatory, but I have not done any tests to check which is faster. Unless performance is a primary concern, I think going with the more readable version is a good default choice though!

Tomas Petricek
  • 240,744
  • 19
  • 378
  • 553
  • Thanks for the help. I ran this on my actual project and had a few comments. This worked except I still needed to add a "removeDuplicateRows" in cases where (sticking to this example) the info frame might have the same "City, State" listed twice. Also, I could not get one of my infoFrames to merge with this function but it would with the one above. Also, – Darien Shannon May 05 '17 at 19:33
  • Lastly I ran some test. In (rows, cols), primary frame = (29635, 7), first info frame = (2572, 4), & second info frame = (18601, 2). 5 runs of merging all three averaged 5.97 sec. with my method and 27.05 sec. with yours. Nonetheless thanks for looking into this and answering so quickly! – Darien Shannon May 05 '17 at 19:47
  • Nice work and thanks for running the tests :-) Sounds like your way is the way to go! – Tomas Petricek May 05 '17 at 19:50
  • One last test... Saved my frames to csv, loaded them into pandas DataFrame's and ran pandas.merge... Pandas averaged 0.0156 seconds to merge the 3 frames. With that said I'll take the 7 seconds over the extra hours it would have taken me to get the analysis done (prior to the merge) correctly in python vs in F#. – Darien Shannon May 05 '17 at 20:40