0

I need to delete the duplicate records, having ID columns unique values and other columns having duplicate values, first need to find frmo table those records and delete.

select *
FROM HOURLY_REPORT_TABLE
where API_DATE = TO_DATE('27-SEP-20','dd-MON-yy')
  and API_HOUR = 17;

ID  APPLICATION API_DATE    API_HOUR    SO  APP API ACTUAL_API  AVG_RUN TOTAL_TRANS GOOD_TRANS  FAIL_TRANS  FAIL_PERC   COUNTS_TO1  PERC_TO1    COUNTS_TO15 PERC_TO15   COUNTS_OVER15   PERC_OVER15 COUNTS_1TO5 PERC_1TO5   COUNTS_5TO10    PERC_5TO10  COUNTS_10TO15   PERC_10TO15 COUNTS_15TO30   PERC_15TO30 COUNTS_30TO60   PERC_30TO60 COUNTS_OVER60   PERC_OVER60 CREATED_USER_ID CREATED_TIME_STAMP  METRIC  AVG_RUN_GOOD    AVG_RUN_FAIL

106508413   LS  27-SEP-20   19  ATAPortReset    G2  GetCustomerSnapshot GetCustomerSnapshot 0.403   7   7   0   0   7   1   7   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   UFOSODRPT   30-SEP-20   S   0.403   0
105398782   LS  27-SEP-20   19  ATAPortReset    G2  GetCustomerSnapshot GetCustomerSnapshot 0.403   7   7   0   0   7   1   7   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   UFOSODRPT   29-SEP-20   S   0.403   0

Thanks is this query is right for getting 27th sept 2020 records and API_HOUR=17

select *
from hourly_report_table t1
where exists
(
  select *
  from hourly_report_table t2
  where t2.id <> t1.id
  and t2.application = t1.application
  and t2.api_date = t1.api_date
  and t2.api_hour = t1.api_hour
  and t2.SO=t1.so
  and t2.APP=t1.APP
  and t2.API=t1.API
  and t2.ACTUAL_API=t1.ACTUAL_API                  
  and t2.AVG_RUN=t1.avg_run
  and t2.total_trans=t1.total_trans
  and t2.good_trans=t1.good_trans
  and t2.fail_trans=t1.fail_trans
and t2.fail_perc=t1.fail_perc
--and t2.counts_t01=t1.counts_t01
--and t2.perc_t01=t1.perc_t01
and t2.COUNTS_TO15=t1.COUNTS_TO15
and  t2.PERC_TO15 =t1.PERC_TO15 
and   t2.COUNTS_5TO10=t1.COUNTS_5TO10
and   t2.PERC_5TO10 =t1.PERC_5TO10                  
and  t2.COUNTS_10TO15  =t1.COUNTS_10TO15               
and   t2.PERC_10TO15  =t1.PERC_10TO15                 
and t2.COUNTS_15TO30  = t1.COUNTS_15TO30               
   and t2.PERC_15TO30   =t1.PERC_15TO30                 
   and t2.COUNTS_30TO60   = t1.COUNTS_30TO60               
     and t2.PERC_30TO60  =t1.PERC_30TO60                 
   and t2.COUNTS_OVER60    = t1.COUNTS_OVER60               
and t2.PERC_OVER60     = t1.PERC_OVER60                 
and t2.CREATED_USER_ID    = t1.CREATED_USER_ID             
--and t2.CREATED_TIME_STAMP  = t1.CREATED_TIME_STAMP          
and t2.METRIC      = t1.METRIC                      
and t2.AVG_RUN_GOOD   = t1.AVG_RUN_GOOD                
and t2.AVG_RUN_FAIL    = t1.AVG_RUN_FAIL                

)
and t1.API_DATE = TO_DATE('27-SEP-20','dd-MON-yy')
  and t1.API_HOUR = 17;

so is below query fine:

to remove multiple duplicate entries for all hours between 9/27 17:00 and 9/30 13:00 , added the requested change.

select *
from hourly_report_table t1
where exists
(
  select *
  from hourly_report_table t2
  where t2.id <> t1.id
  and t2.application = t1.application
  and t2.api_date = t1.api_date
  and t2.api_hour = t1.api_hour
  and t2.SO=t1.so
  and t2.APP=t1.APP
  and t2.API=t1.API
  and t2.ACTUAL_API=t1.ACTUAL_API                  
  and t2.AVG_RUN=t1.avg_run
  and t2.total_trans=t1.total_trans
  and t2.good_trans=t1.good_trans
  and t2.fail_trans=t1.fail_trans
and t2.fail_perc=t1.fail_perc
and t2.COUNTS_TO1=t1.COUNTS_TO1                  
and t2.PERC_TO1=t1.PERC_TO1                    
and t2.COUNTS_TO15=t1.COUNTS_TO15
and  t2.PERC_TO15 =t1.PERC_TO15 
and   t2.COUNTS_5TO10=t1.COUNTS_5TO10
and   t2.PERC_5TO10 =t1.PERC_5TO10                  
and  t2.COUNTS_10TO15  =t1.COUNTS_10TO15               
and   t2.PERC_10TO15  =t1.PERC_10TO15                 
and t2.COUNTS_15TO30  = t1.COUNTS_15TO30               
   and t2.PERC_15TO30   =t1.PERC_15TO30                 
   and t2.COUNTS_30TO60   = t1.COUNTS_30TO60               
     and t2.PERC_30TO60  =t1.PERC_30TO60                 
   and t2.COUNTS_OVER60    = t1.COUNTS_OVER60               
and t2.PERC_OVER60     = t1.PERC_OVER60                 
and t2.CREATED_USER_ID    = t1.CREATED_USER_ID             
--and t2.CREATED_TIME_STAMP  = t1.CREATED_TIME_STAMP          
and t2.METRIC      = t1.METRIC                      
and t2.AVG_RUN_GOOD   = t1.AVG_RUN_GOOD                
and t2.AVG_RUN_FAIL    = t1.AVG_RUN_FAIL                

)
and api_date + interval '1' hour * api_hour between timestamp '2020-09-27 17:00:00' and timestamp '2020-09-30 13:00:00';

I used this below query it's deleted all previous data, and it's commited now, can you please suggest wht to do now

DELETE FROM HOURLY_REPORT_TABLE WHERE ROWID NOT IN ( SELECT min(ROWID) FROM HOURLY_REPORT_TABLE where --
 api_date + interval '1' hour * api_hour between timestamp '2020-09-27 17:00:00' and timestamp '2020-09-30 13:00:00';
--API_DATE=TO_DATE('28-SEP-20','dd-MON-yy')
--and API_HOUR=17 GROUP BY ID,APPLICATION, API_DATE, API_HOUR, SO, APP, API, ACTUAL_API, AVG_RUN, AVG_RUN_GOOD, AVG_RUN_FAIL, 
TOTAL_TRANS, GOOD_TRANS, FAIL_TRANS, FAIL_PERC, COUNTS_TO1, PERC_TO1, COUNTS_TO15, PERC_TO15, COUNTS_OVER15, PERC_OVER15, 
COUNTS_1TO5, PERC_1TO5, COUNTS_5TO10, PERC_5TO10, COUNTS_10TO15, PERC_10TO15, COUNTS_15TO30, PERC_15TO30, COUNTS_30TO60, 
PERC_30TO60, COUNTS_OVER60, PERC_OVER60, CREATED_USER_ID, CREATED_TIME_STAMP, METRIC, AVG_RUN_GOOD, AVG_RUN_FAIL);
singh_dba
  • 31
  • 4
  • Does this answer your question? [Removing duplicate rows from table in Oracle](https://stackoverflow.com/questions/529098/removing-duplicate-rows-from-table-in-oracle) – Ludovic Oct 02 '20 at 06:47
  • 1
    On a side note: When looking for dates, you can use date literals, e.g. `WHERE api_date = DATE '2020-09-27'`. – Thorsten Kettner Oct 02 '20 at 06:52

1 Answers1

0

The rows are identical except for their ID and creation timestamp. In order to find duplicates, you must compare all other columns:

The query, finding both rows by looking for duplicates with another ID (t2.id <> t1.id):

select *
from hourly_report_table t1
where exists
(
  select *
  from hourly_report_table t2
  where t2.id <> t1.id
  and t2.application = t1.application
  and t2.api_date = t1.api_date
  and t2.api_hour = t1.api_hour
  and ...
);

The delete statement only keeping one row of a group of duplicates by comparing t2.id < t1.id:

delete
from hourly_report_table t1
where exists
(
  select *
  from hourly_report_table t2
  where t2.id < t1.id
  and t2.application = t1.application
  and t2.api_date = t1.api_date
  and t2.api_hour = t1.api_hour
  and ...
);

If you want to restrict this to a particular date and hour, do so.

where exists (...) and api_date = date '2020-09-27' and api_hour = 17

Thus you are only dealing with part of the table, but you must make sure that the DBMS can find this data quickly (and not to have to read the hole table again and again). Provide an index for this:

create index idx1 on hourly_report_table (api_date, api_hour);
Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • Thanks Thorsten, , having ID column only unique values and other columns having duplicate records....there are millions of records how can we delete at faster any way – singh_dba Oct 02 '20 at 06:55
  • Need to include these parameters also in the query: where API_DATE = TO_DATE('27-SEP-20','dd-MON-yy') and API_HOUR = 17; – singh_dba Oct 02 '20 at 07:20
  • SQL speed is a lot about providing appropriate indexes. I've updated my aswer accordingly. – Thorsten Kettner Oct 02 '20 at 08:02
  • Thanks Thorsten, if i need to delete between date: 9/27 17:00 and 9/30 13:00, how to use in above query – singh_dba Oct 02 '20 at 09:06
  • This is a tad more complicated than it looks like on first glance. This doesn't work: `where api_date between date '2020-09-27' and date '2020-09-30' and api_hour between 17 and 13`, because we must relate the hours to the dates. This works: `where api_date + interval '1' hour * api_hour between timestamp '2020-09-27 17:00:00' and timestamp '2020-09-30 13:00:00'`... – Thorsten Kettner Oct 02 '20 at 09:19
  • ... The only disadvantage now is that we are no longer looking up particular dates and hours (`where api_date = ... and api_hour = ...`), but an expression (`api_date + api_hours` so to speak). Thus the index can't be used, because the index works on separate dates and hours. There are two ways to go about this. Either add a redundant limitation on the date (`and api_date between date '2020-09-27' and date '2020-09-30'`) or provide an additional index on the expression (`create index idx2 on hourly_report_table (api_date + interval '1' hour * api_hour)`). – Thorsten Kettner Oct 02 '20 at 09:19
  • edited the change, please suggest can we select first and delete from above query, thanks Thorsten – singh_dba Oct 02 '20 at 09:29
  • I don't understand what you are asking. What is it you think you cannot do by yourself now? You know how to select the duplicates. You know how the delete statement differs from the select statement. So, why don't you just do it? – Thorsten Kettner Oct 02 '20 at 09:53
  • Thanks Thorsten, waiting to get delete script to complete....it's running very long....any suggestion. – singh_dba Oct 02 '20 at 09:57
  • No. I told you to provide the index on api_date and api_hour. I told you to keep `api_date between date '2020-09-27' and date '2020-09-30'` in your statement. There is nothing more you can do, I guess. With these measures taken, this shouldn't take too long, though. Just wait for the statement to finish. – Thorsten Kettner Oct 02 '20 at 10:03
  • Thanks Thorsten, count is: 3203786 – singh_dba Oct 02 '20 at 13:09
  • unique values are present in columns ID so is we have to apply index in ID column. – singh_dba Oct 02 '20 at 15:52
  • Again, I have no idea what you are trying to tell me. Yes, the ID is probably the table's primary key and the DBMS will probably use an index to ensure its uniqueness. So what? – Thorsten Kettner Oct 02 '20 at 16:00
  • Thanks Thorsten, So is we need to create these indexes for faster delete: create index idx2 on hourly_report_table (api_date + interval '1' hour * api_hour) create index idx1 on hourly_report_table (api_date, api_hour); – singh_dba Oct 02 '20 at 16:43
  • Yes, these indexes are meant to speed up the select and delete statements. The index idx2 is only needed for the delete, if you don't have `api_date between ...` in your statement. – Thorsten Kettner Oct 02 '20 at 16:46
  • Thanks Thorsten still duplicate records are there. – singh_dba Oct 03 '20 at 06:13
  • This is strange. You've written the delete statement and I am sure you will agree that there *can* be no duplicates after running it. If you are still seeing duplicates although the update statement has run without an error, this may be caused by only seemingly same data, e.g. a float number being slightly dfferent where your tool only shows rounded numbers or a timestamp where your tool only shows the date part. – Thorsten Kettner Oct 03 '20 at 10:07