1

I am looking for a way to identify similar records in the Oracle 11.2 table within a sliding 3 sec window. There are approximately 500K rows in the table inserted within 24 hours period.

Requirements:

  1. These records should satisfy a similarity score of at least 88% using UTL_MATCH.JARO_WINKLER_SIMILARITY
  2. The FLAG column should be updated for a given record if there is at least one more similar record exists within three seconds window

Table definition:

CREATE TABLE ADDR_TAB
  ( DT DATE NOT NULL,
    ADDR VARCHAR2(200) NOT NULL,
    FLAG INT
  );
CREATE INDEX ADDR_DATE_IDX ON ADDR_TAB(DT);

Sample data:

insert into addr_tab values (to_date('03-OCT-13 04.36.57 PM','DD-MON-RR HH.MI.SS AM'),'test',null);
insert into addr_tab values (to_date('03-OCT-13 04.36.57 PM','DD-MON-RR HH.MI.SS AM'),'test123',null);
insert into addr_tab values (to_date('03-OCT-13 04.36.58 PM','DD-MON-RR HH.MI.SS AM'),'2test2',null);
insert into addr_tab values (to_date('03-OCT-13 04.36.58 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.00 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.02 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.03 PM','DD-MON-RR HH.MI.SS AM'),'1test87',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.03 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.03 PM','DD-MON-RR HH.MI.SS AM'),'12test56',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.04 PM','DD-MON-RR HH.MI.SS AM'),'12test88',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.05 PM','DD-MON-RR HH.MI.SS AM'),'12test56',null);

SQLFiddle: http://sqlfiddle.com/#!4/1b53f/1

Chandu
  • 81,493
  • 19
  • 133
  • 134
Henry
  • 11
  • 1
  • You described the requirements just fine - now it's time to show us what you tried and where did you get stuck – Nir Alfasi Oct 16 '13 at 00:49
  • To @alfasin: I could only think about looping through the records in multiple passes, but I believe there should be a better way to solve the issue by using the analytic functions from Oracle. – Henry Oct 16 '13 at 01:07
  • @Henry, can you show us the expected output. – Noel Oct 16 '13 at 03:46

1 Answers1

1

Try this:

WITH basedata 
     AS (SELECT * 
         FROM   addr_tab 
         ORDER  BY dt) 
SELECT * 
FROM   basedata A 
WHERE  EXISTS (SELECT 1 
               FROM   basedata B 
               WHERE  ( ( A.dt <= ( B.dt + 3 / 86400 ) ) 
                        AND ( A.dt >= ( B.dt - 3 / 86400 ) ) ) 
                      AND a.ROWID <> b.ROWID 
                      AND utl_match.Jaro_winkler(A.addr, B.addr) * 100 >= 88) 

SQLFiddle: http://sqlfiddle.com/#!4/1b53f/22

Chandu
  • 81,493
  • 19
  • 133
  • 134
  • Very interesting solution. Thanks! I have one more follow-up question. Is it possible to modify the query to identify any given number (count) of the similar records within 3 sec window? It is very useful if the threshold changes, which is exactly what happened to me. – Henry Oct 16 '13 at 14:26
  • You can definitely parameterize the query. Will update he post when I get a chance. – Chandu Oct 16 '13 at 18:43
  • Thanks again! I will be very interested to see the modified query that uses more generic approach based on a given threshold (number of similar records within 3 sec) – Henry Oct 17 '13 at 14:07
  • How are you executing this query? SQL Developer/IDE, SQLPlus .. etc? – Chandu Oct 18 '13 at 01:00
  • I am running it in SQL Developer – Henry Oct 18 '13 at 04:58
  • Take a look at this answer, I have already answered similar question. Will update the answer here shortly. http://stackoverflow.com/questions/5653423/how-do-i-use-variables-in-oracle-sql-developer/5653541#5653541 – Chandu Oct 18 '13 at 23:25
  • Hi @Chandu, could you please upload the updated query (generic version) if you already have one? Thanks! – Henry Oct 22 '13 at 22:18