91

What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?

Is it "made" unique and if so how? What is the best way to make it unique?

I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.

Luke Girvin
  • 13,221
  • 9
  • 64
  • 84
littlegreen
  • 7,290
  • 9
  • 45
  • 51

4 Answers4

102

They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.

What happens if you create a CI on a non-unique column

If the clustered index is not a unique index, SQL Server makes any duplicate keys unique by adding an internally generated value called a uniqueifier

Does this lead to bad performance?

Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.

  • How much data the table contains.
  • What is the rate of inserts.
  • How often is the CI used in a select (when no covering indexes exist, pretty much always).

Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
  • 39
    +1 cause everything you say is correct, but just wanted to add: non-unique CI are quite common when range scans on particular (non-unique) column is the prevalent access pattern. – Remus Rusanu Dec 02 '10 at 08:57
  • 1
    @Remus Rusanu: I *was* thinking about adding a disclaimer to my scenario statement like *but that doesn't mean anything*. Thanks for pointing out a scenario where it could be usefull. – Lieven Keersmaekers Dec 02 '10 at 09:44
  • 4
    @Remus: so you mean the niche situation where you have a non-unique column like 'Departmentid' where you query something like 'DepartmentId BETWEEN 1 and 100'? *edit* ah i see what you mean, yes a date column in a logging table is a good example too. – littlegreen Dec 02 '10 at 11:30
  • Hey, I've got an event stream table where multiple rows exist with same "AggregateId" which is a column fo type GUID. The only queries performed on table are to get all events for a given AggregateId. I'm wondering should this be a clustered index or a non clustered index? – Shayan C Nov 28 '18 at 00:02
  • @ShayanC - If performance of retrieval is your primary goal, I'd make it a CI to likely save on IO when retrieving all rows for a given ID. As with all performance scenario's though, the only sure way is to measure. – Lieven Keersmaekers Nov 28 '18 at 11:14
  • I use a non-unique clustering key for fetching user-boundary data for millions of users. Queries often fetch a handful (1-100) of records for a single user, so the data is clustered on {business id, user id}. That keeps each user's data clustered together on a single data page, making the fetches very i/o efficient when fetching all records for a single user. Incidentally, this also eliminates the need for any kind of secondary indexes for any kind of user data query, because all the data for a user is on a single data page, such that any scan takes place entirely in-memory. – Triynko Jul 15 '20 at 15:57
32

I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:

I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....

Let's start with the key things that I look for in a clustering key:

* Unique
* Narrow
* Static

Why Unique? A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Source: Ever-increasing clustering key debate - again!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • A question though, the Queen recommends a newsequentialid to uniquify the data, but SQL Server generates its own uniquifier if you don't specify it. Is there then still any reason to add your own sequential id? – littlegreen Dec 02 '10 at 11:41
  • 3
    @littlegreen: she says if you insist on using GUID's (which are really really bad for use in an clustering index), then at least use `newsequentialid()` to get a almost sequentialized GUID. But yes: if **you** add your own unique ID (I always prefer INT IDENTITY), then you have that value at hand, and you can use it (e.g. to establish a FK relationship). The SQL Server added uniquefiers are invisible to you and thus they're only overhead you can't make use of. – marc_s Dec 02 '10 at 13:02
  • I see. Well that would be an argument in favour of a (CompanyID, DepartmentID, id INT IDENTITY) clustered index instead of just the first two. Thanks! – littlegreen Dec 02 '10 at 14:21
  • 1
    @littlegreen: better yet - make your clustered index **only** on (ID INT IDENTITY) and put the other fields - if needed - into a separate, non-clustered index. The clustered index should be as small as possible - after all, the clustered index columns are being added to each and every entry of each and every non-clustered index on that table, too - so don't waste your bytes with a wide clustered index! – marc_s Dec 02 '10 at 14:43
  • 1
    Yes but then I lose the benefit of all my department data being grouped, and me being able to insert/delete/retrieve a whole department at once. My data will become scattered and inserts/deletes on whole departments or even whole companies will be slow. My queries only ever run on a single company at once and frequently a whole dataset needs to be updated. – littlegreen Dec 02 '10 at 14:47
  • @littlegreen: true - I guess you'll just have to do some performance testing and see which option works out the best for you. – marc_s Dec 02 '10 at 14:52
  • I'll do that. I was also considering table partitioning.. but now I have some new input to think about. Thanks! – littlegreen Dec 02 '10 at 15:03
9

Do clustered indexes have to be unique?

They don't, and there are times where it's better if they're not.

Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is

SELECT * FROM EmployeeTable WHERE DepartmentId=%DepartmentValue%

then it's best for performance if the DepartmentId is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).


Do you have any references?

There's Clustered Index Design Guidelines for example, which says,

With few exceptions, every table should have a clustered index defined on the column, or columns, that offer the following:

  • Can be used for frequently used queries.
  • Provide a high degree of uniqueness.
  • Can be used in range queries.

My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.

Amit Joshi
  • 15,448
  • 21
  • 77
  • 141
ChrisW
  • 54,973
  • 13
  • 116
  • 224
  • Yeah that's what I thought until now, but I also get the exact opposite advice, so I wonder which is true. Do you have any references? – littlegreen Dec 02 '10 at 11:27
  • @littlegreen I edited my answer to try to answer your question. – ChrisW Dec 02 '10 at 13:08
  • Thanks. Yeah okay, I see your point. But if you are regularly inserting a whole country at once, a clustered index on (country, town) would seem cumbersome to me since it requires sorting the data. On the other hand, a sort before insert wouldn't be that much of trouble... – littlegreen Dec 02 '10 at 14:19
  • 3
    Surely in your example, a unique clustered index on {DepartmentID, EmployeeID} would be preferable? Why have the system create a uniqueifier when your existing field would provide uniqueness with less overhead (probably a four-byte INT) and may let you run a few more queries within the index alone? –  Dec 05 '12 at 22:10
0

If you are tuning an old DB this is a Godsend. I am working on Perf issues on a 20-year-old DB. It has nonclustered PKs with 3 - 8 columns. Instead of using all 8 columns to be unique I can pick one column with broad distribution, and it applies a Uniqueifier. It is an Int but by using a column like Project ID it can handle 2147483647 unique projectIDs which is enough for most use-cases. If it is not enough add a second or third column to the cluster. This works without any coding modification in the App layer. 20 years in production and management doesn't have to order a major rewrite.

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 15 '22 at 15:19