select records for range comparison

Question

I am suck in this one. Wish I could do it in pure sql, but at this point any solution will do.

I have ta and tb tables, containing lists of events that occurred approximately at the same time. The goal is to find "orphan" records from ta on tb. E.g.:

create table ta ( dt date, id varchar(1));
insert into ta values( to_date('20130101 13:01:01', 'yyyymmdd hh24:mi:ss') , '1' );
insert into ta values( to_date('20130101 13:01:02', 'yyyymmdd hh24:mi:ss') , '2' );
insert into ta values( to_date('20130101 13:01:03', 'yyyymmdd hh24:mi:ss') , '3' );


create table tb ( dt date, id varchar(1));
insert into tb values( to_date('20130101 13:01:5', 'yyyymmdd hh24:mi:ss') , 'a' );
insert into tb values( to_date('20130101 13:01:6', 'yyyymmdd hh24:mi:ss') , 'b' );

But let's say I must use a threshold of +-5 seconds. So, the query to find would look something like:

  select
    ta.id ida,
    tb.id idb
  from
    ta, tb
  where 
    tb.dt between (ta.dt - 5/86400) and (ta.dt + 5/86400)
  order by 1,2

(fiddle: http://sqlfiddle.com/#!4/b58f7c/5)

The rules are:

Events are mapped 1 to 1
The closest event on tb for a given one in ta will be considered the correct mapping.

That said, the resulting query should return something like

IDA | IDB
1   | a
2   | b
3   | null  <-- orphan event

Though the sample query I've put here shows exactly the issue I am having. When the time overlaps, it is difficult to systematically choose the correct row.

dense_rank() seems to be the answer to select the correct rows, but what partitioning/sorting will place them right?

Worth mentioning, I am doing this on a Oracle 11gR2.

This sounds very difficult, and I think there are some requirements that need to be clarified. For example, why did you match `1` to `a`, when `3` and `a` match closer? (Do you want to consume the records in the order of ta.dt?) Also, what happens if there are ties? For example, what if there were two 'b' rows? Would one row match 2 and another match 3, or would they both match 2? — Jon Heller, Jun 16 '13 at 04:38
according to your definition it looks like the orphan should be 3 — haki, Jun 16 '13 at 11:05
@jonearles you are right, it might take some clarifying. The main rule here though is *events are mapped 1 to 1*. That would in fact mean that "events are consume" once matched - I didn't wan't to mention that because it seems to imply an iterative process which might get too high complexity. In case of ties, either record will do. Ideally would take in chronological order, but doesn't really matter as long as the 1 to 1 mapping is respected. Did I answer your question? — filippo, Jun 16 '13 at 15:36
@haki That's what I meant to show. `ta.ia = 3` has no correspondent record on `tb`. — filippo, Jun 16 '13 at 15:39

score 2 · Accepted Answer · answered Jun 16 '13 at 17:31

It seems like this should be possible with a single SQL statement using Oracle's analytic functions, perhaps with some combination of row_number(), lag(), and max() over. But I simply couldn't wrap my head around it. I kept on wanting to embed one analytic function within another, and I don't think you can do that. You can go in steps using Common Table Expressions, but I couldn't figure out how to make it work.

But a procedural solution is fairly straight forward using PL*SQL along with an extra table to store your result. I use row_number() to assign a chronological rank to each row in each of your source tables. You want a determinate result, so it's important to have a tie breaker in case you have duplicate date-times, hence my order by of dt, id. Here is a SQL-Fiddle demo.

Or look at the code below:

create table result ( 
  dif number, 
  ida varchar(1),
  idb varchar(1),
  dta date,
  dtb date
);

declare
  prevA integer := 0;
  prevB integer := 0;
begin
  for rec in (
    with 
    ordered_ta as (
      select dt dta,
             id ida,
             row_number() over (order by dt, id) rowNumA
        from ta
    ),
    ordered_tb as (
      select dt dtb,
             id idb, 
             row_number() over (order by dt, id) rowNumB 
        from tb
    )
    select ta.*,
           tb.*,
           abs(dta - dtb) * 86400 dif
      from ordered_ta ta
      join ordered_tb tb
        on dtb between (dta - 5/86400) and (dta + 5/86400)
     order by rowNumA, rowNumB
  )
  loop
    if rec.rowNumA > prevA and rec.rowNumB > prevB then
      prevA := rec.rowNumA;
      prevB := rec.rowNumB;
      insert into result values (
        rec.dif,
        rec.ida,
        rec.idb,
        rec.dta,
        rec.dtb
      );
    end if;
  end loop;
end;
/

select * from result
union all
select null dif, id ida, null idb, dt dta, null dtb
  from ta
 where id not in (select ida from result)
union all
select null dif, null ida, id idb, null dta, dt dtb
  from tb
 where id not in (select idb from result)
;

Hey, thanks for your answer. Looking at it seems the complexity is quite high. I have tried it with a few million records and went slow. I am attempting to index the query to get some more performance, but still... — filippo, Jun 17 '13 at 17:49

select records for range comparison

1 Answers1