10

I am trying to select duplicate records based on a match of three columns. The list of triples could be very long (1000), so I would like to make it concise.

When I have a list of size 10 (known duplicates) it only matches 2 (seemingly random ones) and misses the other 8. I expected 10 records to return, but only saw 2.

I've narrowed it down to this problem:

This returns one record. Expecting 2:

select * 
from ali
where (accountOid, dt, x) in
(
  (64, '2014-03-01', 10000.0), 
  (64, '2014-04-23', -122.91)
)

Returns two records, as expected:

select * 
from ali
where (accountOid, dt, x) in ( (64, '2014-03-01', 10000.0) )
or (accountOid, dt, x) in ( (64, '2014-04-23', -122.91) )

Any ideas why the first query only returns one record?

user3877299
  • 101
  • 1
  • 6
  • 2
    I was unable to reproduce the problem on my machine (MySQL 5.6.14). Can you make a fiddle? – Vatev Jul 25 '14 at 14:30
  • 1
    Can you edit your question and describe the data types of columns (`SHOW CREATE TABLE ali`)? If you're using `FLOAT` or `DOUBLE` for column `x` it can cause equality comparisons to fail because the exact value is rounded in unexpected ways. Also please specify exactly which version of MySQL you use. – Bill Karwin Jul 25 '14 at 17:58
  • Can you post the code you are using. I copied this to a table and at first it looked like it was an operand value, but that was my fault in missing a parenthesis set. Like Vatev I was unable to reproduce the problem – DanceSC Jul 25 '14 at 18:31
  • Don't know how to make it a fiddle. I've changed my query to use: (a,b,c) = (...) or (a,b,c) = () or (a,b,c) = () ... Not ideal I know, but it works. I'll post the code when I have more time. Thanks for your interest. – user3877299 Jul 28 '14 at 15:15
  • google for sql fiddle (sqlfiddle.com), its a plattform for posting and testing SQL. Or simply post your table create statement here. – dube Jul 29 '14 at 12:48

1 Answers1

3

I'd suggest you don't use IN() for this, instead use a where exists query, e.g.:

CREATE TABLE inlist
    (`id` int, `accountOid` int, `dt` datetime, `x` decimal(18,4))
;

INSERT INTO inlist
    (`id`, `accountOid`, `dt`, `x`)
VALUES
    (1, 64, '2014-03-01 00:00:00', 10000.0),
    (2, 64, '2014-04-23 00:00:00', -122.91)
;

select *
from ali
where exists ( select null
               from inlist
               where ali.accountOid = inlist.accountOid
               and ali.dt = inlist.dt
               and ali.x = inlist.x
             )
;

I was able to reproduce a problem (compare http://sqlfiddle.com/#!2/7d2658/6 to http://sqlfiddle.com/#!2/fe851/1 both MySQL 5.5.3) where if the x column was numeric and the value negative it was NOT matched using IN() but was matched when either numeric or decimal using a table and where exists.

Perhaps not a conclusive test but personally I wouldn't have used IN() for this anyway.

Why are you not determining the duplicates this way?

select
        accountOid
      , dt
      , x
from ali
group by
        accountOid
      , dt
      , x
having
        count(*) > 1

Then use that as a derived table within the where exists condition:

select *
from ali
where exists (
               select null
               from (
                      select
                              accountOid
                            , dt
                            , x
                      from ali
                      group by
                              accountOid
                            , dt
                            , x
                      having
                              count(*) > 1
                     ) as inlist
               where ali.accountOid = inlist.accountOid
               and ali.dt = inlist.dt
               and ali.x = inlist.x
             )

see http://sqlfiddle.com/#!2/ede292/1 for the query immediately above

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • 1
    The [inner join syntax would seem simpler](http://sqlfiddle.com/#!2/ede292/2) to me. I would also expect the join to be as efficient, although, according to SQL Fiddle, the query plans do differ and I am not very good at interpreting those. – Andriy M Jul 30 '14 at 06:21
  • yes, good point, inner join using the same derived table is an option too. sqlfiddle's execution plans often aren't that relevant as there are no indexes and data scale is too small - but there isn't a big difference in the 2 approaches. – Paul Maxwell Jul 30 '14 at 06:51