SQL Capture duplicate records across two DIFFERENT columns

Question

I am writing an Exception Catching Page using MySQL for catching duplicate billing entries the following scenario.

Items details are entered in a table which has the following two columns (among others).

ItemCode VARCHAR(50), BillEntryDate DATE

It often happens that same item's bill is entered multiple times, but over a period of few days. Like,

"Football","2019-01-02"
"Basketball","2019-01-02"
...
...
"Football","2019-01-05"
"Rugby","2019-01-05"
...
"Handball","2019-01-05"
"Rugby","2019-01-07"
"Rugby","2019-01-10"

In the above example, the item Football is billed twice - first on 2Jan and again on 5Jan. Similarly, item Rugby is billed thrice on 5,7,10Jan.

I am looking to write simple SQL which can pickup each item [say, using distinct(ItemCode) clause], and then display all the records which are duplicates over a period of 30 days. In the above case, the expected output should be the following 5 records:

"Football","2019-01-02"
"Football","2019-01-05"
"Rugby","2019-01-05"
"Rugby","2019-01-07"
"Rugby","2019-01-10"

I am trying to run the following SQL:

select * from tablen a, tablen b, where a.ItemCode=b.ItemCode and a.BillEntryDate = b.BillEntryDate+30;

However, this seems to be highly inefficient as it is running for long without displaying any records. Is there any possibility for getting a less complex and faster method?

I did explore existing topics (like How do I find duplicates across multiple columns?), but it is catching duplicates where BOTH columns have same value. My requirement is one column same value, and second column varying over a month-long date range.

Gordon Linoff · Accepted Answer · 2019-02-03T14:44:08.680

1

You can use:

select t.*
from tablen t
where exists (select 1
              from tablen t2
              where t2.ItemCode = t.ItemCode and
                    t2.BillEntryDate <> t.BillEntryDate and
                    t2.BillEntryDate >= t1.BillEntryDate - interval 30 day and                    t2.BillEntryDate <= t1.BillEntryDate + interval 30 day 
             );

This will pick up both duplicates in the pair.

For performance, you want an index on (ItemCode, BillEntryDate).

edited Feb 03 '19 at 14:44

answered Feb 03 '19 at 13:55

Gordon Linoff

1,242,037
58
646
786

Thanks @GordonLinoff. However, this one is not working. It is simply returning all the records as is, and is taking 76.5 seconds to complete the query execution on a table having around 17K records.The server admins wont be happy. :-) Thanks for the index tip, I already have an index on those two columns. – Aquaholic Feb 03 '19 at 14:13
@user6337701 . . . Of course, it needs to ignore the "same" row. I just added the inequality. – Gordon Linoff Feb 03 '19 at 14:44
This one now works, Thanks! Though it still takes a very long time to run (exceeding 3 minutes) which makes it complex to use on a webpage. I'll see if execution can be improved. +1 and marking it as an answer. I have been using the following NON-sql method which is much faster: Generate all records ordered by ItemCode,BillEntryDate. Take that dump in excel. Put formula to get this row's BillEntryDate - previous row's BillEntryDate. Now filter this column for values in the range -30 to +30. Though NON-sql, this is quicker so I may stick to this for the once-in-amonth activity. – Aquaholic Feb 03 '19 at 15:05
@user6337701 . . . For best performance, you want a single composite index as described in the answer. – Gordon Linoff Feb 03 '19 at 15:33

forpas · Answer 2 · 2019-02-03T15:52:27.000

0

With EXISTS:

select ItemCode, BillEntryDate
from tablename t
where exists (
  select 1 from tablename 
  where 
    ItemCode = t.ItemCode  
    and 
    abs(datediff(BillEntryDate, t.BillEntryDate)) between 1 and 30
)

edited Feb 03 '19 at 15:52

answered Feb 03 '19 at 13:55

forpas

160,666
10
38
76

Thanks @forpas. As far as I know, now() returns current system date and time of server. The query is not limited to recent one month period - it should be generic. Anyways, I ran the query you provided (with now() function) and another modified version by replacing now() with t.BillEntryDate at both the instances in the SQL. In both cases, it is taking more than 3 minutes on a table with 17K records, and the output displayed is not correct. – Aquaholic Feb 03 '19 at 14:56
About the 3 mins I can't tell. But as for the time period maybe I misunderstood. Can you explain the generic part? – forpas Feb 03 '19 at 15:07
generic means it should not be within today's +30 or -30 days. It seems that use of now() is limiting it to that date range. It should run across all periods. – Aquaholic Feb 03 '19 at 15:11
So you want the duplicate dates to be less than 30 days apart? – forpas Feb 03 '19 at 15:14

SQL Capture duplicate records across two DIFFERENT columns

2 Answers2