4

I am using Snowflake database and ran this query to find total count, number of distinct records and difference:

select 
    (select count(*) from mytable) as total_count, 
    (select count(*) from (select distinct * from mytable)) as distinct_count,
    (select count(*) from mytable) - (select count(*) from (select distinct * from mytable)) as duplicate_count
from mytable limit 1;

Result:

1,759,867
1,738,924
20,943 (duplicate_count)

But when try with the other approach (group ALL columns and find where count is > 1):

select count(*) from (
SELECT 
    a, b, c, d, e,
    COUNT(*)
FROM 
    mytable
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1
)

I get 5,436.

Why there is a difference in number of duplicates? (20,943 vs 5,436)

Thanks.

Joe
  • 11,983
  • 31
  • 109
  • 183

1 Answers1

11

Okay. Let's start from one simple example:

create table #test
(a int, b int, c int, d int, e int)

insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (1,2,3,4,5)
insert into #test values (5,4,3,2,1)
insert into #test values (5,4,3,2,1)
insert into #test values (1,1,1,1,1)

And try your subquery to understand what you will get:

SELECT 
    a, b, c, d, e,
    COUNT(*)
FROM 
    #test
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1

Think about a while...

Dang Dang Dang Dang ~

a   b   c   d   e   (No column name)
1   2   3   4   5   5
5   4   3   2   1   2

It will only return two rows because you used 'group by'. But it still counted the duplicate numbers for each a,b,c,d,e combinations.

If you want the total number of the duplicates, try this:

select sum(sub_count) from (
SELECT 
    a, b, c, d, e,
    COUNT(*) - 1 as sub_count
FROM 
    #test
GROUP BY 
    a, b, c, d, e
HAVING 
    COUNT(*) > 1)a

You need to minus one in this case if I understand your original queries correctly. Correct me if I am wrong.

Gen Wan
  • 1,979
  • 2
  • 12
  • 19
  • Great, thanks.. It make sense since for each group at first I assigned only 1 count (but instead it can be more than 1 of the same duplicate).. Now the number match.. – Joe May 21 '19 at 23:21
  • Is there a way to delete them from a table without using a 'temp' table? – Joe May 21 '19 at 23:29
  • @Joe 1, if you are asking how to delete duplicates rows. Come here: https://stackoverflow.com/questions/18390574/how-to-delete-duplicate-rows-in-sql-server/38938704. 2, I used temp table for testing purpose. You can delete without a temp table. – Gen Wan May 21 '19 at 23:34