The question appears to confuse two different concepts. The first is a primary key, the second is a clustered index. The first is a logical concept, the latter is a physical concept and refers to how the data is actually stored. There are cases when it is useful to decouple the primary key and clustering key, but for the most part they are one and the same, and by default your primary key will be your clustering key. It is an important distinction nonetheless.
I think people can (and have) argued until the cows come home about whether to use a natural or surrogate primary key. I won't touch upon this too much, but the basic is what you are suggesting when using the 3 columns that would define a unique row is a natural key (i.e. already exists in your data), and another approach is to use an identity column, which will give each row a unique value, this is a surrogate key since it has no actual meaning other than to uniquely identify your row.
So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a
primary key on my table help to improve the performance of select queries?
It doesn't, having an index might help depending on your queries. Given the right index the database engine is able to navigate directly to the required data.
Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?
This is a good candidate for the clustering key. According to the queen of indexing Kimberly Tripp a clustered index should be:
- Unique
- Narrow
- Static
- Ever increasing pattern
You have already ticked the unique box, with your 3 columns, this is not that narrow, but not wide by any means. The second I can't answer, if UploadDate
is a default value that is entered at the time of creation then you could have an ever increasing pattern, and I have no idea if your three columns are static or they could change. If either of these last two are true then you should be using a surrogate identity column to cluster on regardless.
I would personally probably have eliminated this as a candidate for a clustering key based on the with (26 bytes). You have an extra 4 bytes per row in the clustered index, but you save 22 bytes per row in all subsequent indexes.
So in a table of 10,000,000 rows you gain an extra 38.1 MB due to the identity column, however you gain 209.8MB for each non clustered index, although disk space is cheap, it is not a reason to waste it unnecessarily. It is not just all indexes that gain these 22 bytes, it is also all referencing tables with foreign keys, which leads to my next point, convenience when writing queries. Do you really want to have to type out this join each time you refer to the key:
SELECT *
FROM Parent AS p
INNER JOIN Child AS c
ON c.UploadDate = p.UploadDate
AND c.SecID = p.SecID
AND c.FundCode = p.FundCode;
Or would you rather simply write:
SELECT *
FROM Parent AS p
INNER JOIN Child AS c
ON c.ParentID = p.ParentID;
For this reason, even if I have decided that what is logically a primary key is not a good candidate for a clustering key, I tend to still make the clustering key the primary key for ease of reference in relation tables. For example I have an external API that sends me order details in XML:
<orders>
<order ID="12B47EF2-B9F5-4CD7-811F-2E7EC1A67E59">
<orderdetail>
<product>Some Product</product>
<quantity>1</quantity</quantity>
</orderdetail>
<orderdetail>
<product>Some Other Product</product>
<quantity>2</quantity</quantity>
</orderdetail>
</order>
<order ID="3A819217-49CA-4B4C-8AD5-CAD297FCA3F3">
<etc />
</order>
</orders>
If I was setting up my tables to store this, although the ID from XML would be the logical primary key for my Orders
table, it would be a terrible clustering key, so I would add a surrogate identity field to avoid the fragmentation associated with clustering on a GUID:
CREATE TABLE dbo.Orders
(
OrderID INT IDENTITY NOT NULL,
SupplierOrderID UNIQUEIDENTIFIER NOT NULL,
CONSTRAINT PK_Orders__SupplierOrderID PRIMARY KEY NONCLUSTERED (SupplierOrderID)
);
CREATE UNIQUE CLUSTERED INDEX UQ_Orders__OrderID ON dbo.Orders (OrderID);
The GUID is still the primary key, so my order detail table can refer to this, but I generally think that if I don't consider the key good enough to cluster on, why would I then put the same key into another table as a foreign key. I have already defined a more narrow key in OrderID
, why not just use this as my foreign key in order details, and save myself 12 bytes. So I would end up with:
CREATE TABLE dbo.Orders
(
OrderID INT IDENTITY NOT NULL,
SupplierOrderID UNIQUEIDENTIFIER NOT NULL,
CONSTRAINT PK_Orders__OrderID PRIMARY KEY CLUSTERED (OrderID)
);
CREATE UNIQUE NONCLUSTERED INDEX UQ_Orders__SupplierOrderID ON dbo.Orders (SupplierOrderID);
As with everything, there are exceptions, and there are cases where I would choose the 3 columns as a composite (clustered) primary key, and this woule be if I knew there would be no child tables, and that all my select queries would still require me to select UploadedDate
, SecID
, and FundCode
. If you had an index on Name
for example:
CREATE NONCLUSTERED INDEX IX_YourTable__Name ON dbo.YourTable (Name);
SELECT UploadDate, SecID, FundCode, Name
FROM dbo.YourTable
WHERE Name = 'Bob';
If you have a surrogate key, then you will seek through the name index and find Bob at row 2 only, then lookup row 2 on your clustered index to get the corresponding values for UploadedDate
, SecID
, and FundCode
. If these three columns are your clustering key, then you remove the need for the lookup since you already have the data in the name
index. The extra 209.8MB on each index could be worth it to avoid these lookup operations.
In summary (as usual), it depends - it depends on both your personal preference (I believe Aaron Bertrand and Joe Celko are still at loggerheads on the natural vs surrogate key debate, and if these two great minds can't agree, then the answer really has to be personal preference), and also your exact situation, in some situations you will want a composite primary key, in some instances you will want a surrogate key, in some instances you will want your primary key and your clustering key to be the same thing, in other instances you won't.