I am looking to merge two Deedle (F#) frames based on a specific column in each frame in a similar manner as pandas.DataFrame.Merge.The perfect example of this would be a primary frame that contains columns of data and a (city, state) column along with an information frame that contains the following columns: (city, state); lat; long. If I want to add the lat long columns into my primary frame, I would merge the two frames on the (city, state) column.
Here is an example:
let primaryFrame =
[(0, "Job Name", box "Job 1")
(0, "City, State", box "Reno, NV")
(1, "Job Name", box "Job 2")
(1, "City, State", box "Portland, OR")
(2, "Job Name", box "Job 3")
(2, "City, State", box "Portland, OR")
(3, "Job Name", box "Job 4")
(3, "City, State", box "Sacramento, CA")] |> Frame.ofValues
let infoFrame =
[(0, "City, State", box "Reno, NV")
(0, "Lat", box "Reno_NV_Lat")
(0, "Long", box "Reno_NV_Long")
(1, "City, State", box "Portland, OR")
(1, "Lat", box "Portland_OR_Lat")
(1, "Long", box "Portland_OR_Long")] |> Frame.ofValues
// see code for merge_on below.
let mergedFrame = primaryFrame
|> merge_On infoFrame "City, State" null
Which would result in 'mergedFrame' looking like this:
> mergedFrame.Format();;
val it : string =
" Job Name City, State Lat Long
0 -> Job 1 Reno, NV Reno_NV_Lat Reno_NV_Long
1 -> Job 2 Portland, OR Portland_OR_Lat Portland_OR_Long
2 -> Job 3 Portland, OR Portland_OR_Lat Portland_OR_Long
3 -> Job 4 Sacramento, CA <missing> <missing>
I have come up with a way of doing this (the 'merge_on' function used in the example above), but being a Sales Engineer who is new to F#, I imagine there is a more idiomatic/efficient way of doing this. Below is my functions for doing this along with a 'removeDuplicateRows' which does what you would expect and was needed for the 'merge_on' function; if you want to comment on a better way of doing this as well, please do.
let removeDuplicateRows column (frame : Frame<'a, 'b>) =
let nonDupKeys = frame.GroupRowsBy(column).RowKeys
|> Seq.distinctBy (fun (a, b) -> a)
|> Seq.map (fun (a, b) -> b)
frame.Rows.[nonDupKeys]
let merge_On (infoFrame : Frame<'c, 'b>) mergeOnCol missingReplacement
(primaryFrame : Frame<'a,'b>) =
let frame = primaryFrame.Clone()
let infoFrame = infoFrame
|> removeDuplicateRows mergeOnCol
|> Frame.indexRows mergeOnCol
let initialSeries = frame.GetColumn(mergeOnCol)
let infoFrameRows = infoFrame.RowKeys
for colKey in infoFrame.ColumnKeys do
let newSeries =
[for v in initialSeries.ValuesAll do
if Seq.contains v infoFrameRows then
let key = infoFrame.GetRow(v)
yield key.[colKey]
else
yield box missingReplacement ]
frame.AddColumn(colKey, newSeries)
frame
Thanks for your help!
UPDATE:
Switched Frame.indexRowsString to Frame.indexRows to handle cases where the types in the 'mergOnCol' are not strings.
Got rid of infoFrame.Clone() as suggested by Tomas