If clustered index is table data, how can it not be unique?

Question

I was seaching how to move a table from one filegroup to the other, and I had some doubts as to why most of the replies I found dealt with clustered indexes, considering that my question had to do with tables.

Then I looked at How I can move table to another filegroup?, and it says that the clustered index is the table data, which explains the reasoning behind recreating a clustered index with CREATE CLUSTERED INDEX.

But in that same question it says that if my clustered index is unique, then do something else.

My question: I assume that when I create tables on a database, a clustered index is created for that table. So how can it not be unique?

Thanks.

usr · Answer 1 · 2016-08-28T09:03:02.117

3

If you have an int array and you store the number 1 twice in it - how can that array not be unique?! (Trick question to get you thinking. It clearly can be not unique.) Being unique is a constraint on the data. Fundamentally, there is nothing preventing you from creating multiple rows that have the same values in all columns.

In a heap this is not a problem physically at all. The internal row identifier is it's location on disk.

In a b-tree based index (a "clustered index") the physical data structure indeed requires uniqueness. Note, that the logical structure (the table) does not. This is a physical concern. It's an implementation detail. SQL Server does this by internally appending a key column that contains a sequence number that is counted upwards. This disambiguates the records. You can observe this effect by creating more than 2^32 rows with the same non-unique key. You will receive an error.

So there's a hidden column in the table that you cannot access. It's officially called "uniqueifier". Internally, it's used to complete the CI key to make it unique. It's stored and used everywhere where normally the unique CI key would be used: In the CI, in non-unique NCIs, in the lock hash and in query plans.

edited Aug 28 '16 at 09:03

answered Aug 24 '16 at 22:37

usr

168,620
35
240
369

1

To elaborate a little bit on the uniqueifier, you don't need all 2^32 rows with the duplicate key in the table at the same time. We got bitten by this as we had an application insert a nonce as an ID (clustered key) and then update it later. After we overflowed the four byte uniqueifier, we got the SQL 666 error. – Ben Thul Aug 25 '16 at 00:03
@BenThul ah, so it's apparently incremented based on the max existing uniquifier. Nasty. – usr Aug 25 '16 at 12:07
That's understating it. AFAIK, there's no way to expose the "current" value (short of some `DBCC PAGE` voodoo), and you can't rely on doing a count of the rows that currently have that value. So monitoring for this is a nightmare. As in, you really have to be paying attention to things like insertion rate and what that implies with regards to exhaustion of the key space. – Ben Thul Aug 25 '16 at 12:37
But it only applies to non-unique CI's, right? For some reason I never have such a thing. What was your use case? Did you need/want to cluster on some non-key? – usr Aug 25 '16 at 12:48
Correct. In the case of a unique CI, the uniqueifier isn't necessary, so there's nothing to exhaust. Without divulging too much, the PK and the clustering key were different on this table. Think of a membership table where the PK is MemberID and the CI is GroupID. The idea being that the main access path is by GroupID, so when I ask for GroupID=12345, all of those rows are near each other physically. And the anti-pattern described above was something like "when we add a Member that doesn't have a group yet, put them in this table, but with a GroupID=-1". – Ben Thul Aug 25 '16 at 12:53
I see. That's an insidious time bomb. – usr Aug 25 '16 at 12:55
Yep. I had a play with this and `DBCC WRITEPAGE` here http://dba.stackexchange.com/a/49458/3690 – Martin Smith Aug 26 '16 at 07:16
It may be helpful to know that while some official documentation refers to it as 'uniqueifier', it's also shown internally (e.g., the column name returned by running DBCC PAGE) as 'uniquifier'. – ozmo Jul 29 '19 at 05:04

score 1 · Answer 2 · answered Aug 25 '16 at 00:10

If Clustered Index is not unique then SQL Server internally creates Uniquifier to make uniqueness on that record. I will try to explain with an example:

CREATE TABLE Test2 (Col1 INT, Col2 INT)

CREATE CLUSTERED INDEX idxClustered ON Test2 (Col1) 
CREATE NONCLUSTERED INDEX idxNonClustered ON test2 (Col2)

Here cluserered index is not unique

INSERT INTO Test2 VALUES (1,1), (2,2) 
INSERT INTO Test2 VALUES (3,3)
INSERT INTO Test2 VALUES (3,3)

--Get the Page Number of the Non Clustered Index
DBCC IND (Test, Test2, -1)

--Examine the Results of the Page
--Not to run in production
DBCC TRACEON (3604); 
DBCC PAGE(Test, 1, 3376, 3);

You will see Uniquifier key with corresponding uniqueness value... If your clustered index is Unique Clustered Index then It will not have that Uniquifier attribute.

clifton_h · Answer 3 · 2016-08-26T05:31:07.007

**usr* has a good post worth reading. I will add here from Microsofts Documentation.

First, you are not alone with Clustered-Indexes. Honestly, the name itself is somewhat confusing (Structured-Indexes or Disk-Indexes would probably be better in SQL).

Refer back to the official documentation from MSDN. Any alterations by me are in italics:

A Clustered Index is an on-disk structure of the table. This means the values are pointing to a physical location. This is why when you move the table you need to recreate the Index because the physical location has been altered.

Clustered

Clustered indexes sort and store the data rows in the table or view based on their key values. These are the columns included in the index definition. There can be only one clustered index per table, because the data rows themselves can be sorted in only one order.

The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. When a table has a clustered index, the table is called a clustered table. If a table has no clustered index, its data rows are stored in an unordered structure called a heap.

Nonclustered

Nonclustered indexes have a structure separate from the data rows (like pointers, this is a logical ordering of the data that consumes a fraction of the physical disk space).

A nonclustered index contains the nonclustered index key values and each key value entry has a pointer to the data row that contains the key value.

The pointer from an index row in a nonclustered index to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or a clustered table (think ordered).

For a heap, a row locator is a pointer to the row.

For a clustered table, the row locator is the clustered index key.

ABSTRACT VIEW:

A table created is not necessarily a clustered (ordered) table.
An index does not necessarily have to be unique. It is an abstract view of the table.
- Unique means that a value or set of values will not repeat themselves. If you wish to enforce this, you can add a constraint by the index (i.e. UNIQUE CLUSTERED INDEX) or a CONSTRAINT such as PRIMARY KEY if you wish this to be managed in the table structure itself.
You may have multiple unique indexes since as long as the values are represented logically, they will not share the same value as another row pointer.

Consider you have Columns A, B, and C in a given table.

Column A was created with a UNIQUE CLUSTERED INDEX. This means that either A already had an enforceable UNIQUE constraint (like PK, UNIQUE CONSTRAINT) or was DECLARED EXPLICITLY.

A Column Group {B,C} could be a unique index so long as B and C never repeat itself together. In the same way, you could theoretically have indexes with the groups {A}, {B,C}, {A,C}, and every one of them be unique. Recall that an index is a logical ordering of the data so they likely will not have the same logical value (and thus are unique).

HOWEVER: unless the datatype, constraint (including the INDEX constraint), or table structure enforces a unique constraint on a COLUMN, you should not assume the index is unique. Furthermore, you cannot create a UNIQUE index if there are more than one rows containing the same combination of NULL values since SQL Server will treat them as the same value (NULL being unknown).

Will SQL Server use your indexes, unique or not? Well that is another story and depends on a number of things. But hopefully you find this post helpful.

Sources: MSDN - Clustered and Nonclustered Indexes Described

I would have preferred "primary storage" instead of "clustered index" and b-tree as term for both CI and NCI. SQL Server now recognizes many storage formats: heaps, b-trees, columnstore, hekaton. I think it should have been `create primary/secondary btree/heap/columnstore/hekaton index ...`. And each table should start out with a primary heap. That would be a nice model. We won't ever get this :) — usr, Aug 26 '16 at 09:52

score 0 · Answer 4 · edited May 23 '17 at 12:10

0

A clustered index doesn't have to be unique. But, there can be only one clustered index on a table, because a clustered index actually determines the physical order of the table rows on disk (but I find it confusing to say that the clustered index is the table data, per se, even though they are strongly tied to each other).

HERE is a good post all about non-unique clustered indexes. Even if the index was the entire row of data, you can certainly have duplicate rows (no PK), which would equate to duplicate clustered index nodes.

edited May 23 '17 at 12:10

Community

1
1

answered Aug 24 '16 at 21:58

SlimsGhost

2,849
1
10
16

1

There's really no physical difference between CI and NCI. It's an artificial distinction and the "physical order of rows" is not a well defined term. A better mental model is to consider all indexes as (partial) but equal copies of the data. – usr Aug 24 '16 at 22:32
@usr wow, what? According to whom? Physical just means the way they are organized...we are not talking machine level organization but rather the order which SOL Server reads them. A non-clustered index is a logical ordering. Why do you think there can only be one Clustered Index on a table? Because Microsoft was being funny? – clifton_h Aug 24 '16 at 23:41
It's a design decision that simplifies a lot of things. Designating one of the copies of the data as "primary" makes the model easier. If you have a CI and an NCI that contains all columns you have the same data stored twice. If the key columns are the same in both there are not even physical differences (that matter). The query optimizer also does not care about whether a b-tree is CI or NCI. The fundamental concept is that the table data can be stored in multiple formats that are good for different accesses. There is no inherent rule that forces one of the copies to be special. – usr Aug 25 '16 at 12:06
@clifton_h There can be only one CI because there is only one primary copy of the data. The rows at the leaf level of the clustered index *are* the table rows so it is meaningless to attach any significance to the CI rows being in the same order as the table rows as they are both exactly the same thing. The rows at the leaf level of the clustered index aren't necessarily stored in any particular order, such as ordered by the clustered index key. Unfortunately some of the wording in books online gives the wrong impression about this. More details at http://stackoverflow.com/a/24470091/73226 – Martin Smith Aug 25 '16 at 21:29
@Martin.Smith my habit is to recheck my sources and i am glad I did. MSDN confirms that CI is an `On-disk` structure of the data. There can only be one because it is a physical order. NCI has a `row-pointer` which is a logical expression of the position of the data. [Structured and Nonstructured Indexes - MSDN](https://msdn.microsoft.com/en-us/library/ms190457.aspx) – clifton_h Aug 26 '16 at 03:43
@clifton As I already told you books online is misleading about this. But feel free to carry on believing the wrong thing. – Martin Smith Aug 26 '16 at 04:52
@MartinSmith wow, you are a bad word, as my sister would say. My "books online" happened to be MICROSOFT's own documentation. Furthermore, I happen to have a background in OOP languages, so I am quite aware of the difference between physical and logical orderings and understand enough to call you out on your arrogant omniscience. – clifton_h Aug 26 '16 at 05:15
@clifton Books online (BOL) is the name for SQL Servers product documentation from MICROSOFT. It is by no means 100% accurate and certainly contains errors. – Martin Smith Aug 26 '16 at 05:23
@MartinSmith says the guy who did not quote BOL or whatever term you push out to make yourself look smart. My problem is you insist on being high and mighty rather than improve the conversation. – clifton_h Aug 26 '16 at 05:27

If clustered index is table data, how can it not be unique?

4 Answers4