Pairwise Sequence Processing to compare db tables

Question

Consider the following Use case:
I want to iterate through 2 db tables in parallel and find differences and gaps/missing records in either table. Assume that 1) pk of table is an Int ID field; 2) the tables are read in ID order; 3) records may be missing from either table (with corresponding sequence gaps).

I'd like to do this in a single pass over each db - using lazy reads. (My initial version of this program uses sequence objects and the data reader - unfortunately makes multiple passes over each db).

I've thought of using pairwise sequence processing and use Seq.skip within the iterations to try and keep the table processing in sync. However apparently this is very slow as I Seq.skip has a high overhead (creating new sequences under the hood) so this could be a problem with a large table (say 200k recs).

I imagine this is a common design pattern (compare concurrent data streams from different sources) and am interested in feedback/comments/links to similar projects.

Anyone care to comment?

You realize the fastest way to do this is inside the database? Any speed saving you get with algorithmic improvements will be DWARFED by removing the need to transfer ALL the data over the network. — hova, Apr 13 '11 at 20:03
@hova: The scenario I described was overly simplified and possibly misleading. The real world use case here consists of similar tables in different databases that need to be compared during a data migration project (e.g. converting from an Access Db to Postgres or MySQl). You can also extend this scenario to handle data from various external (possibly remote) data sources (e.g. Excel files, csv files, log files, XML etc. So the goal here is to solve the more general case. — BrendanC, Apr 14 '11 at 05:54
Are you looking for gaps in id's, differences in column values, or both? — Robert Jeppesen, Apr 14 '11 at 11:44
All of the above - imagine your job was to validate a data conversion project between a desktop db and Oracle, Sql Server and you need to know about any and all discrepancies - the iteration/sequence processing is just the tip of the iceberg here. — BrendanC, Apr 14 '11 at 17:59

Laurent · Answer 1 · 2011-04-14T14:22:20.947

When you use sequences, any lazy function adds some overhead on the sequence. Calling Seq.skip thousands of times on the same sequence will clearly be slow.

You can use Seq.zip or Seq.map2 to process two sequences at a time:

> Seq.map2 (+) [1..3] [10..12];;
val it : seq<int> = seq [11; 13; 15]

If the Seq module is not enough, you might need to write your own function. I'm not sure if I understand what you try to do, but this sample function might help you:

let fct (s1: seq<_>) (s2: seq<_>) =
    use e1 = s1.GetEnumerator()
    use e2 = s2.GetEnumerator()
    let rec walk () =

        // do some stuff with the element of both sequences
        printfn "%d %d" e1.Current e2.Current

        if cond1 then // move in both sequences
            if e1.MoveNext() && e2.MoveNext() then walk ()
            else () // end of a sequence

        elif cond2 then // move to the next element of s1
            if e1.MoveNext() then walk()
            else () // end of s1

        elif cond3 then // move to the next element of s2
            if e2.MoveNext() then walk ()
            else () // end of s2

    // we need at least one element in each sequence
    if e1.MoveNext() && e2.MoveNext() then walk()

Edit :

The previous function was meant to extend functionality of the Seq module, and you'll probably want to make it a high-order function. As ildjarn said, using LazyList can lead to cleaner code:

let rec merge (l1: LazyList<_>) (l2: LazyList<_>) =
    match l1, l2 with
    | LazyList.Cons(h1, t1), LazyList.Cons(h2, t2) ->
        if h1 <= h2 then LazyList.cons h1 (merge t1 l2)
        else LazyList.cons h2 (merge l1 t2)
    | LazyList.Nil, l2 -> l2
    | _ -> l1

merge (LazyList.ofSeq [1; 4; 5; 7]) (LazyList.ofSeq [1; 2; 3; 6; 8; 9])

But I still think you should separate the iteration of your data, from the processing. Writing a high-order function to iterate is a good idea (at the end, it's not annoying if the iterator function code uses mutable enumerators).

I don't think the map and zip fns will work here since my test sequences are of unknown (and possibly) unequal length and can also contain gaps. However your sample code looks like a good start - thx — BrendanC, Apr 14 '11 at 06:02
The sample code would surely blow a StackOverflow with 200k records? — Robert Jeppesen, Apr 14 '11 at 11:40
No, the compiler is able to optimize these recursive calls and compile the function with a simple loop (I've checked the generated IL). If you don't trust the compiler, feel free to use a while loop, the transformation is very easy to do here. — Laurent, Apr 14 '11 at 11:51
IMO, working with enumerators directly forces an imperative style that nearly defeats the purpose of working in a functional language in the first place. Coercing the `seq`s into `LazyList`s at least allows for pattern matching and functional programming style. — ildjarn, Apr 14 '11 at 13:44
@ildjarn: All the Seq library is written in this way, this cannot be a bad thing. I think such a function should be made generic (high-order function) and reusable, so the logic can be separate from the mutable enumerators. My answer was about defining a new primitive for iteration. You're right that LazyList could be an alternative. — Laurent, Apr 14 '11 at 13:59
@Laurent : Right, and the `Seq` module exists so that no one else has to endure writing code in this style :-P I'm not downvoting the answer by any means, I just think this is a strange level of abstraction for F# code when easy alternatives exist :-] — ildjarn, Apr 14 '11 at 14:00

score 1 · Answer 2 · edited May 23 '17 at 12:15

Here's my (completely untested) take, doing a single pass over both tables:

let findDifferences readerA readerB =
    let idsA, idsB =
        let getIds (reader:System.Data.Common.DbDataReader) =
            reader |> LazyList.unfold (fun reader ->
                if reader.Read ()
                then Some (reader.GetInt32 0, reader)
                else None)
        getIds readerA, getIds readerB

    let onlyInA, onlyInB = ResizeArray<_>(), ResizeArray<_>()
    let rec impl a b =
        let inline handleOnlyInA idA as' = onlyInA.Add idA; impl as' b
        let inline handleOnlyInB idB bs' = onlyInB.Add idB; impl a bs'
        match a, b with
        | LazyList.Cons (idA, as'), LazyList.Cons (idB, bs') ->
                if   idA < idB then handleOnlyInA idA as'
                elif idA > idB then handleOnlyInB idB bs'
                else impl as' bs'
        | LazyList.Nil, LazyList.Nil  -> () // termination condition
        | LazyList.Cons (idA, as'), _ -> handleOnlyInA idA as'
        | _, LazyList.Cons (idB, bs') -> handleOnlyInB idB bs'
    impl idsA idsB
    onlyInA.ToArray (), onlyInB.ToArray ()

This takes two DataReaders (one for each table) and returns two int[]s which indicate the IDs that were only present in their respective table. The code assumes that the ID field is of type int and is at ordinal index 0.

Also note that this code uses LazyList from the F# PowerPack, so you'll need to get that if you don't already have it. If you're targeting .NET 4.0 then I strongly recommend getting the .NET 4.0 binaries which I've built and hosted here, as the binaries from the F# PowerPack site only target .NET 2.0 and sometimes don't play nice with VS2010 SP1 (see this thread for more info: Problem with F# Powerpack. Method not found error).

Pairwise Sequence Processing to compare db tables

2 Answers2