Remove Resultset Duplicates Based On Three Columns

Question

I have a complex oracle view which returns data which has logical duplicates within the rows returned. My goal is to retrieve only one row when duplicates are found based two columns (text and datetime), but to decide which one of the duplicates to return will be based on a third column (datetime).

I have simulated the result set below into a table with stubbed data (as found here on SQLFiddle):

CREATE TABLE TimeTable (
  ID number NOT NULL,
  NAME VARCHAR2(20) NOT NULL,      -- Grouped by this first
  TARGETVALUE INT,                 -- ultimate target value to be returned (no precedence from this value)
  NOTE VARCHAR2(20) NULL,          -- Just a note for the developer on StackOverflow
  BEGIN_DATE TIMESTAMP NOT NULL,   -- Grouped by this 2nd (down to the minute, not seconds) 
  APPROVAL_DATE TIMESTAMP NOT NULL -- Decides the ties for duplicates

 );

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(1, 'Alpha', 5,  'Duplicate First', '08-MAR-14 09.43.00.000000000', 
                                    '09-MAR-14 09.43.00.000000000');

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(2, 'Alpha', 2,  'Duplicate Middle', '08-MAR-14 09.43.00.000000000', 
                                     '09-MAR-14 09.43.00.000000000');


 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(3, 'Alpha', 3, 'Final Target', '08-MAR-14 09.43.00.000000000', 
                                '09-MAR-14 10.00.00.000000000');

-- Same time as alpha, but not related.
 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(4, 'Beta', 4, 'Only Target', '08-MAR-14 09.43.30.000000000', 
                              '09-MAR-14 11.00.30.000000000');

The result set which is needed would be 2 rows of

3, 'Alpha', 3, '08-MAR-14 09.43.00.000000000', '09-MAR-14 10.00.00.000000000'
4, 'Beta', 4, '08-MAR-14 09.43.30.000000000' '09-MAR-14 11.00.30.000000000'

Note for clarification if I had this value in the database

5, 'Alpha', 8, '09-MAR-14 09.43.00.000000000', '12-MAR-14 10.00.00.000000000'

Then that Alpha set would be unique and also returned because it is not considered a duplicate due to the different BEGIN_DATE (which is the 9th of March not 8th).

Here are the rules to follow

The NAME relates data.
The BEGIN_DATE is the second relation where exact times up to the minute will have duplicates which need to be weeded out to be based on #3.
If there are duplicates per #1 and #2, then they will be removed as determined by the latest APPROVAL_DATE which will win over the earlier dates.

Unrelated, but: you should not rely on implicit data type conversion. Using a character literal such as `'08-MAR-14 09.43.00.000000000'` will not work reliably if used from a client computer that has different NLS settings. Use a proper ANSI timestamp literal or use the `to_timestamp()` function using month numbers, not names. — , Oct 05 '14 at 11:42
@a_horse_with_no_name (*Hey we are both 70's trivia questions*) Thanks for the tip, Oracle is not my forte but I am learning. I needed a quick example for SO and timestamp was simply a convenience; but your advice is well taken. thx. — ΩmegaMan, Oct 06 '14 at 14:09

Lalit Kumar B · Accepted Answer · 2014-10-06T05:52:18.953

It should be a simple implementation of ANALYTICS to aggregate the data based on the rules mentioned.

You want the MAX of APPROVAL DATE among each group of NAME, BEGIN_DATE. So, all you need to do is :

MAX(APPROVAL_DATE) OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) max_appr_dt

And, in your outer query, just filter out the DUPLICATES using WHERE APPROVAL_DATE = max_aapr_dt in the PREDICATE.

Note From PERFORMANCE point of view, this approach will do a TABLE SCAN only once. Thus, much better than the other approach of joining the table and having multiple table scans

Update Adding complete test cases as requested in comments

There are two ways using analytic :

1.MAX

SQL> SELECT *
  2  FROM
  3    (SELECT A.*,
  4      MAX(APPROVAL_DATE) OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) max_appr_dt
  5    FROM TIMETABLE A
  6    )
  7  WHERE approval_date = max_appr_dt
  8  /

        ID NAME                 TARGETVALUE NOTE                 BEGIN_DATE                     APPROVAL_DATE                  MAX_APPR_DT
---------- -------------------- ----------- -------------------- ------------------------------ ------------------------------ ------------------------------
         3 Alpha                          3 Final Target         08-MAR-14 09.43.00.000000 AM   09-MAR-14 10.00.00.000000 AM   09-MAR-14 10.00.00.000000 AM
         4 Beta                           4 Only Target          08-MAR-14 09.43.30.000000 AM   09-MAR-14 11.00.30.000000 AM   09-MAR-14 11.00.30.000000 AM

2.ROW_NUMBER()

SQL> SELECT *
  2  FROM
  3    (SELECT a.*,
  4      row_number() OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) AS "RNK"
  5    FROM TIMETABLE A
  6    )
  7  WHERE rnk =1
  8  /

        ID NAME                 TARGETVALUE NOTE                 BEGIN_DATE                     APPROVAL_DATE                         RNK
---------- -------------------- ----------- -------------------- ------------------------------ ------------------------------ ----------
         3 Alpha                          3 Final Target         08-MAR-14 09.43.00.000000 AM   09-MAR-14 10.00.00.000000 AM            1
         4 Beta                           4 Only Target          08-MAR-14 09.43.30.000000 AM   09-MAR-14 11.00.30.000000 AM            1

Execution plans for both the queries :

SQL> EXPLAIN PLAN FOR
  2  SELECT *
  3  FROM
  4    (SELECT A.*,
  5      MAX(APPROVAL_DATE) OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) max_appr_dt
  6    FROM TIMETABLE A
  7    )
  8  WHERE approval_date = max_appr_dt
  9  /

Explained.

SQL>
SQL> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2691156688

---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |     4 |   356 |     3   (0)| 00:00:01 |
|*  1 |  VIEW               |           |     4 |   356 |     3   (0)| 00:00:01 |
|   2 |   WINDOW SORT       |           |     4 |   304 |     3   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| TIMETABLE |     4 |   304 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("APPROVAL_DATE"="MAX_APPR_DT")

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

19 rows selected.

SQL>
SQL> EXPLAIN PLAN FOR
  2  SELECT *
  3  FROM
  4    (SELECT a.*,
  5      row_number() OVER(PARTITION BY NAME, BEGIN_DATE ORDER BY APPROVAL_DATE DESC) AS "RNK"
  6    FROM TIMETABLE A
  7    )
  8  WHERE rnk =1
  9  /

Explained.

SQL>
SQL> select * from table(dbms_xplan.display)
  2  /

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3768566268

--------------------------------------------------------------------------------------
| Id  | Operation                | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |           |     4 |   356 |     3   (0)| 00:00:01 |
|*  1 |  VIEW                    |           |     4 |   356 |     3   (0)| 00:00:01 |
|*  2 |   WINDOW SORT PUSHED RANK|           |     4 |   304 |     3   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL     | TIMETABLE |     4 |   304 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RNK"=1)
   2 - filter(ROW_NUMBER() OVER ( PARTITION BY "NAME","BEGIN_DATE" ORDER BY
              INTERNAL_FUNCTION("APPROVAL_DATE") DESC )<=1)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

21 rows selected.

If you don't mind, can you please add the full query and also explain how this will scan the table only once ? Thanks. — Erran Morad, Oct 05 '14 at 16:31
Sure. But for that I need to create two test cases and show the execution plan. I wish if OP had provided the create and insert statements. I will try to update more information regarding perfoemance when time permits. — Lalit Kumar B, Oct 05 '14 at 16:36
There are create and insert statements in the question if thats what you wanted. Thanks. — Erran Morad, Oct 05 '14 at 16:38
Yes, I saw that. But insert statements are useless, since I have to modify the timestamp values to mention `TO_TIMESTAMP` explicitly to avoid implicit conversion which is quite concerning. Give me sometime to elaborate and post the test results. — Lalit Kumar B, Oct 05 '14 at 16:46
@LalitKumarB I provided the Oracle database sandbox in [sqlfiddle](http://sqlfiddle.com/#!4/91db4) with create table and inserts which can be used...what needs to be done *by me* past what is provided in SQLFiddle? Thanks! — ΩmegaMan, Oct 05 '14 at 17:57
@BoratSagdiyev, I just added an update with the queries and execution plans. — Lalit Kumar B, Oct 06 '14 at 05:53
@LalitKumarB Wow, I have learned something new and that is way cool. I will apply it on the *actual* SQL query on the client's db and get back to marking an answer or not. Thanks! — ΩmegaMan, Oct 06 '14 at 14:04
And, you should really take care of implicit data conversions from next time. In Oracle, anything enclosed within single quotes is considered a string literal. In your case, the `timestamp` datatype is implicitly converted. — Lalit Kumar B, Oct 06 '14 at 14:17
@LalitKumarB Timestamp was simply a convenience to getting a similar example onto SO. I am used to SQL Server's `Cast` operation for DateTime and didn't take time to learn the Oracle way of loading up a DateTime. *My Bad*. But your point is well taken and next time I will not use it. — ΩmegaMan, Oct 06 '14 at 14:31
Ah, ok. I am not verse with Sql server ;-) But in Oracle, you should explicitly use `TO_TIMESTAMP` to convert the literal into a valid timestamp value. — Lalit Kumar B, Oct 06 '14 at 14:32

score 1 · Answer 2 · answered Oct 05 '14 at 01:55

1

I know that you are using Oracle DB. But, I tested this using SQL server. The SQL should work for all DBs. Try my query though. I am not sure if this is the most efficient way to do it. Let me know if this helps.

select t.ID, t.name, t.targetvalue, t.begin_date, t.approval_date
from
(
select name, begin_date, max(approval_date) as approval_date
from timetable
group by name, begin_date
) as mx
inner join timetable as t
on mx.name = t.name and
mx.begin_date = t.begin_date and
mx.approval_date = t.approval_date

Extra query - if you want to create the table in the question inside SQL server -

CREATE TABLE TimeTable (
  ID int NOT NULL,
  NAME VARCHAR(20) NOT NULL,      
  TARGETVALUE INT,                
  NOTE VARCHAR(20) NULL,          
  BEGIN_DATE datetime NOT NULL,  
  APPROVAL_DATE datetime NOT NULL 

 );

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(1, 'Alpha', 5,  'Duplicate First', '08-03-14 09:43:00', 
                                    '09-03-14 09:43:00');

 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(2, 'Alpha', 2,  'Duplicate Middle', '08-03-14 09:43:00', 
                                     '09-03-14 09:43:00');


 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(3, 'Alpha', 3, 'Final Target', '08-03-14 09:43:00', 
                                '09-03-14 10:00:00');

-- Same time as alpha, but not related:
 insert into TimeTable (ID, NAME, TARGETVALUE, NOTE, BEGIN_DATE, APPROVAL_DATE) values 
(4, 'Beta', 4, 'Only Target', '08-03-14 09:43:30', 
                              '09-03-14 11:00:30');

answered Oct 05 '14 at 01:55

Erran Morad

4,563
10
43
72

It's an overkill. It will do the table scan twice, using `ANALYTICS` you could avoid that. See my answer. – Lalit Kumar B Oct 05 '14 at 08:53
@LalitKumarB - Thanks. I had seen these functions in SQL server but not in Oracle. Good to know. I'll let my answer remain to show how NOT to write such a query. – Erran Morad Oct 05 '14 at 16:25
I will try to imclude performance tests in my answer. But that would be not exactly related to OP's current requirement. It would be an add on to the answers. I appreciate your answer for SQL Server. Good. – Lalit Kumar B Oct 05 '14 at 16:39
1

Haha. Nice. There is a borat movie. Its mostly funny, but sometimes with lame jokes too. Check it out. Btw, it would be great if you could give us the full ORA query. Thanks. – Erran Morad Oct 05 '14 at 16:42
@BoratSagdiyev I appreciate your answer and frankly my stomping ground is SQL Server where temporary tables roam free unlike the prison Oracle provides (IMHO). But my example is basic to allow for an answer from SO, but the reality of the `View` I mentioned is a query of intense sql unions and at first look your answer is spot on ...if I could put the data into a temp table and do what you suggested. But alas I am not allowed to create a temp table on my clients database (*don't ask* lol) and your suggestion would require me to requiry the db; but your answer does provide insights. – ΩmegaMan Oct 05 '14 at 18:03
@BoratSagdiyev BTW, in SQL Server one can combine multiple inserts into the database by simply stringing the inserts as comma delimited after a single declaration. That is one thing I couldn't do in Oracle. :-) – ΩmegaMan Oct 06 '14 at 22:47
@Omegaman, you can do something like this in Oracle http://www.techonthenet.com/oracle/questions/insert_rows.php – Lalit Kumar B Oct 07 '14 at 02:54
@LalitKumarB THX for the link, but Oracle requires one to always specify the columns before each insert while SQL Server requires it only once for a type of batch inserts. See the example here [Inserting multiple rows in a single SQL query?](http://stackoverflow.com/questions/452859/inserting-multiple-rows-in-a-single-sql-query). If you can find an `Oracle` way of doing that, do let me know. :-) – ΩmegaMan Oct 07 '14 at 16:17
@OmegaMan, I understand what you say. But, Oracle and SQL Server are two different database products. And they have their own rules and functionalities. Not all being ANSI standards though. And, I am an Oracle database consultant, and have almost zero knowledge about SQL Server :-( – Lalit Kumar B Oct 07 '14 at 16:22
@LalitKumarB My original response to @BoratSagdiyev was that in `SQL Server` one can do the bulk loading whereas in `Oracle` one can't. Your reply to that suggested that Oracle could; sadly it can't. There is a confusion on topics here; thanks for the reply. – ΩmegaMan Oct 07 '14 at 18:25

Remove Resultset Duplicates Based On Three Columns

2 Answers2