Select 1000 distinct names from 100 million records via standard sql

Question

I have a table tb_FirstName with one field FirstName. The table has 100 million non null records with lots of repetitions e.g. John occurs 2 million times. The distinct count of FirstName is over 2 million.

How do I select 1000 distinct names as quickly as possible using standard sql?

I'm currently using the following but this is

tSQL

Maybe not as efficient as it could be.

SELECT x.FirstName
FROM (
    SELECT  FirstName,
            rnk = RANK() OVER (ORDER BY Firstname)
    FROM    WHData.dbo.tb_DimUserAccount A
    GROUP BY FirstName
    ) x
WHERE rnk <=1000

You might find this usefull http://stackoverflow.com/questions/595123/is-there-an-ansi-sql-alternative-to-the-mysql-limit-keyword but may I ask what's the intention for having a table tb_FirstName full of duplicates? — bummi, Mar 29 '13 at 11:38
@bummi this question is a slimmed down example of the reality - the table actually has 30 fields but no need to include that in the question — whytheq, Mar 29 '13 at 12:10
100MM names? sounds like a spam/marketing database of personal info... not sure if to help you or not — Remus Rusanu, Mar 29 '13 at 13:42
@RemusRusanu we have lots of users over past 10+yrs. Not spam and not marketing. Just analysis. — whytheq, Mar 29 '13 at 13:49
For analysis you should extract all the distinct names into a separate table, once, and then join at will with it. — Remus Rusanu, Mar 29 '13 at 13:51

score 12 · Accepted Answer · answered Mar 29 '13 at 11:40

12

Seems like you could use TOP 1000 with DISTINCT:

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
ORDER BY FirstName

Condensed SQL Fiddle Demo

answered Mar 29 '13 at 11:40

sgeddes

62,311
6
61
83

@whytheq -- np, glad we could help! – sgeddes Mar 29 '13 at 13:12
FYI: does not work on columns that can't be compared, like XML or TEXT. – Maury Markowitz Jan 22 '19 at 14:54

score 3 · Answer 2 · answered Mar 29 '13 at 11:43

3

Try this

SELECT TOP 1000 FirstName FROM 
(SELECT 
ROW_NUMBER() OVER(PARTITION BY FirstName ORDER BY FirstName) NO,
 FirstName FROM WHData.dbo.tb_DimUserAccount )
  AS T1 WHERE no =1

or

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount ORDER BY FirstName

answered Mar 29 '13 at 11:43

Harshil

403
2
7

The second approach is very appealing. It is neat and accurate. Plus it is very readable. – Oybek Mar 29 '13 at 11:48
2

OP asks for standard SQL, while `ROW_NUMBER() OVER(PARTITION...` looks like tSQL feature. – Artemix Mar 29 '13 at 12:10
Second is ANSI standard me believes. – Jason Carter Mar 29 '13 at 12:53

score 2 · Answer 3 · answered Mar 29 '13 at 11:39

2

Make sure you have an index defined on FirstName.

SELECT TOP 1000 FirstName
FROM (SELECT DISTINCT FirstName
FROM dbo.tb_DimUserAccount) N
ORDER BY FirstName

answered Mar 29 '13 at 11:39

Nick

401
6
19

1

No performance improvement because full table scan is being done in sub query. This is same as post. – Romil Kumar Jain Mar 29 '13 at 11:43
True, full table scan still required but performance slightly better performance than post – Nick Mar 29 '13 at 11:59

score 2 · Answer 4 · answered Mar 29 '13 at 11:41

2

You need the data after sorting the results on FirstName fields.

It requires full table scan if Index is not created. If Index is created on FirstName then Unique Index scan can improve the time.

answered Mar 29 '13 at 11:41

Romil Kumar Jain

20,239
9
63
92

2+ John's is not a rare scenario – Oybek Mar 29 '13 at 11:47
We can't place the clustered Index, so Unique Index will have better performance rather than a normal non clustered index. – Romil Kumar Jain Mar 29 '13 at 11:52
Unique index is ok only if the target column contains only unique elements. According to the name of the column `FirstName` it is highly unlikely this to be true. – Oybek Mar 29 '13 at 11:55

score 1 · Answer 5 · answered Mar 29 '13 at 11:53

1

Option with GROUP BY clause

SELECT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
GROUP BY FirstName
ORDER BY FirstName

answered Mar 29 '13 at 11:53

Aleksandr Fedorenko

16,594
6
37
44

Select 1000 distinct names from 100 million records via standard sql

5 Answers5