7

I have a table tb_FirstName with one field FirstName. The table has 100 million non null records with lots of repetitions e.g. John occurs 2 million times. The distinct count of FirstName is over 2 million.

How do I select 1000 distinct names as quickly as possible using standard sql?

I'm currently using the following but this is

  • tSQL
  • Maybe not as efficient as it could be.

    SELECT x.FirstName
    FROM (
        SELECT  FirstName,
                rnk = RANK() OVER (ORDER BY Firstname)
        FROM    WHData.dbo.tb_DimUserAccount A
        GROUP BY FirstName
        ) x
    WHERE rnk <=1000
    
Deduplicator
  • 44,692
  • 7
  • 66
  • 118
whytheq
  • 34,466
  • 65
  • 172
  • 267
  • You might find this usefull http://stackoverflow.com/questions/595123/is-there-an-ansi-sql-alternative-to-the-mysql-limit-keyword but may I ask what's the intention for having a table tb_FirstName full of duplicates? – bummi Mar 29 '13 at 11:38
  • @bummi this question is a slimmed down example of the reality - the table actually has 30 fields but no need to include that in the question – whytheq Mar 29 '13 at 12:10
  • 100MM names? sounds like a spam/marketing database of personal info... not sure if to help you or not – Remus Rusanu Mar 29 '13 at 13:42
  • @RemusRusanu we have lots of users over past 10+yrs. Not spam and not marketing. Just analysis. – whytheq Mar 29 '13 at 13:49
  • 1
    For analysis you should extract all the distinct names into a separate table, once, and then join at will with it. – Remus Rusanu Mar 29 '13 at 13:51

5 Answers5

12

Seems like you could use TOP 1000 with DISTINCT:

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
ORDER BY FirstName

Condensed SQL Fiddle Demo

sgeddes
  • 62,311
  • 6
  • 61
  • 83
3

Try this

SELECT TOP 1000 FirstName FROM 
(SELECT 
ROW_NUMBER() OVER(PARTITION BY FirstName ORDER BY FirstName) NO,
 FirstName FROM WHData.dbo.tb_DimUserAccount )
  AS T1 WHERE no =1 

or

SELECT DISINCT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount ORDER BY FirstName
Harshil
  • 403
  • 2
  • 7
2

Make sure you have an index defined on FirstName.

SELECT TOP 1000 FirstName
FROM (SELECT DISTINCT FirstName
FROM dbo.tb_DimUserAccount) N
ORDER BY FirstName
Nick
  • 401
  • 6
  • 19
2

You need the data after sorting the results on FirstName fields.

It requires full table scan if Index is not created. If Index is created on FirstName then Unique Index scan can improve the time.

Romil Kumar Jain
  • 20,239
  • 9
  • 63
  • 92
  • 2+ John's is not a rare scenario – Oybek Mar 29 '13 at 11:47
  • We can't place the clustered Index, so Unique Index will have better performance rather than a normal non clustered index. – Romil Kumar Jain Mar 29 '13 at 11:52
  • Unique index is ok only if the target column contains only unique elements. According to the name of the column `FirstName` it is highly unlikely this to be true. – Oybek Mar 29 '13 at 11:55
1

Option with GROUP BY clause

SELECT TOP 1000 FirstName
FROM WHData.dbo.tb_DimUserAccount
GROUP BY FirstName
ORDER BY FirstName
Aleksandr Fedorenko
  • 16,594
  • 6
  • 37
  • 44