understanding how a primary key works & how to use it

Question

I am using SQL Server and creating a table (example is at the very bottom of this question). However I am having some issues understanding how primary keys actually work & how to use them properly.

So I know a primary key ensures all the rows in a table are unique and that a primary key can't be null. I also read this page index basics - simple talk on indices and how indices are organised in a b-tree structure.

So in my table for a row to have a unique value I would have to use the first 3 columns (UploadDate, SecID & FundCode of types datetime, varchar(12) & varchar(6)). Only select queries will be used on this table & the where clause will be using one or more of the the three fields just mentioned.

So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a primary key on my table help to improve the performance of select queries? So I take it the primary key creates an index or some sort with the value of of your column (or in my case 3 columns), I don't see how this will help as my value will be a datetime & two bits of text?

Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?

type             column name
-------------    ------------
datetime         UploadDate
varchar(12)      SecID
varchar(6)       FundCode
varchar(100)     Name
float            Price
float            Nominal
int              SourceCode
datetime         PriceDate

example of a some rows

UploadDate   SecID    FundCode   Name   Price   Nominal   SourceCode   PriceDate
2015-08-20   A045     ABCVPL     Joe    1.3434  1000.33   3
2015-08-20   A563     ABCVPL     Bob    1.5961  10.33     3
2015-08-20   A045     DEFGHJ     Joe    1.3434  856.41    3
2015-08-20   XC45     PLMNOI     Pip    2.3654  25.52     3
2015-08-20   KMM5     ABCVPL     Nit    6.9565  1532      3
2015-08-21   A045     ABCVPL     Joe    4.3434  1112      3
2015-08-21   GH45     DEFGHJ     Joe    3.3434  16532     3
2015-08-21   PL34     DEFGHJ     Joe    7.3434  635       3
2015-08-21   ER33     ABCVPL     Joe    8.3434  6320      3

"Someone mentioned that I should just create an integer column with random numbers" - I would say that is false. Increment numbers yes, random numbers no. — Kritner, Aug 24 '15 at 12:33
Here's some information on the different "types" of primary keys and the reasoning behind using one over another. http://www.agiledata.org/essays/keys.html — Kritner, Aug 24 '15 at 12:34
@Kritner sorry they actually did say increment numbers not random numbers - will correct that. I still don't see how that helps though as these numbers don't relate to the data in any meaningful way? — mHelpMe, Aug 24 '15 at 12:35
I think this might have already been answered here: [SO:How To Speed Up SQL Queries][1] [1]: http://stackoverflow.com/questions/17354219/how-to-speed-up-sql-queries-indexes — Bryant Frankford, Aug 24 '15 at 12:48
http://stackoverflow.com/questions/31949355/what-clustered-index-scan-clustered-means-on-sql-server-execution-plan/31961357#31961357 THIS ANS MIGHT HELP — Neeraj Prasad Sharma, Aug 24 '15 at 13:01
http://stackoverflow.com/questions/32137416/designing-new-table-for-daily-uploads-use-unique-constraint/32138848#32138848 AND THIS AS WELL — Neeraj Prasad Sharma, Aug 24 '15 at 13:07

score 3 · Accepted Answer · answered Aug 24 '15 at 13:41

The question appears to confuse two different concepts. The first is a primary key, the second is a clustered index. The first is a logical concept, the latter is a physical concept and refers to how the data is actually stored. There are cases when it is useful to decouple the primary key and clustering key, but for the most part they are one and the same, and by default your primary key will be your clustering key. It is an important distinction nonetheless.

I think people can (and have) argued until the cows come home about whether to use a natural or surrogate primary key. I won't touch upon this too much, but the basic is what you are suggesting when using the 3 columns that would define a unique row is a natural key (i.e. already exists in your data), and another approach is to use an identity column, which will give each row a unique value, this is a surrogate key since it has no actual meaning other than to uniquely identify your row.

So I know can create a primary key over multiple columns so in my case it would be the 3 above. How though does having a primary key on my table help to improve the performance of select queries?

It doesn't, having an index might help depending on your queries. Given the right index the database engine is able to navigate directly to the required data.

Someone mentioned that I should just create an integer column with incrementing numbers and make that the primary key - I can't see how that can help when running a select query as the new field won't have any meaning & wouldn't be used in any select query or where clause of the query?

This is a good candidate for the clustering key. According to the queen of indexing Kimberly Tripp a clustered index should be:

Unique
Narrow
Static
Ever increasing pattern

You have already ticked the unique box, with your 3 columns, this is not that narrow, but not wide by any means. The second I can't answer, if UploadDate is a default value that is entered at the time of creation then you could have an ever increasing pattern, and I have no idea if your three columns are static or they could change. If either of these last two are true then you should be using a surrogate identity column to cluster on regardless.

I would personally probably have eliminated this as a candidate for a clustering key based on the with (26 bytes). You have an extra 4 bytes per row in the clustered index, but you save 22 bytes per row in all subsequent indexes.

So in a table of 10,000,000 rows you gain an extra 38.1 MB due to the identity column, however you gain 209.8MB for each non clustered index, although disk space is cheap, it is not a reason to waste it unnecessarily. It is not just all indexes that gain these 22 bytes, it is also all referencing tables with foreign keys, which leads to my next point, convenience when writing queries. Do you really want to have to type out this join each time you refer to the key:

SELECT  *
FROM    Parent AS p
        INNER JOIN Child AS c
            ON c.UploadDate = p.UploadDate
            AND c.SecID = p.SecID
            AND c.FundCode = p.FundCode;

Or would you rather simply write:

SELECT  *
FROM    Parent AS p
        INNER JOIN Child AS c
            ON c.ParentID = p.ParentID;

For this reason, even if I have decided that what is logically a primary key is not a good candidate for a clustering key, I tend to still make the clustering key the primary key for ease of reference in relation tables. For example I have an external API that sends me order details in XML:

<orders>
    <order ID="12B47EF2-B9F5-4CD7-811F-2E7EC1A67E59">
        <orderdetail>
            <product>Some Product</product>
            <quantity>1</quantity</quantity>
        </orderdetail>
        <orderdetail>
            <product>Some Other Product</product>
            <quantity>2</quantity</quantity>
        </orderdetail>
    </order>
    <order ID="3A819217-49CA-4B4C-8AD5-CAD297FCA3F3">
        <etc />
    </order>
</orders>

If I was setting up my tables to store this, although the ID from XML would be the logical primary key for my Orders table, it would be a terrible clustering key, so I would add a surrogate identity field to avoid the fragmentation associated with clustering on a GUID:

CREATE TABLE dbo.Orders
(
        OrderID INT IDENTITY NOT NULL,
        SupplierOrderID UNIQUEIDENTIFIER NOT NULL,

    CONSTRAINT PK_Orders__SupplierOrderID PRIMARY KEY NONCLUSTERED (SupplierOrderID)
);
CREATE UNIQUE CLUSTERED INDEX UQ_Orders__OrderID ON dbo.Orders (OrderID);

The GUID is still the primary key, so my order detail table can refer to this, but I generally think that if I don't consider the key good enough to cluster on, why would I then put the same key into another table as a foreign key. I have already defined a more narrow key in OrderID, why not just use this as my foreign key in order details, and save myself 12 bytes. So I would end up with:

CREATE TABLE dbo.Orders
(
        OrderID INT IDENTITY NOT NULL,
        SupplierOrderID UNIQUEIDENTIFIER NOT NULL,

    CONSTRAINT PK_Orders__OrderID PRIMARY KEY CLUSTERED (OrderID)
);
CREATE UNIQUE NONCLUSTERED INDEX UQ_Orders__SupplierOrderID ON dbo.Orders (SupplierOrderID);

As with everything, there are exceptions, and there are cases where I would choose the 3 columns as a composite (clustered) primary key, and this woule be if I knew there would be no child tables, and that all my select queries would still require me to select UploadedDate, SecID, and FundCode. If you had an index on Name for example:

CREATE NONCLUSTERED INDEX IX_YourTable__Name ON dbo.YourTable (Name);

SELECT  UploadDate, SecID, FundCode, Name
FROM    dbo.YourTable
WHERE   Name = 'Bob';

If you have a surrogate key, then you will seek through the name index and find Bob at row 2 only, then lookup row 2 on your clustered index to get the corresponding values for UploadedDate, SecID, and FundCode. If these three columns are your clustering key, then you remove the need for the lookup since you already have the data in the name index. The extra 209.8MB on each index could be worth it to avoid these lookup operations.

In summary (as usual), it depends - it depends on both your personal preference (I believe Aaron Bertrand and Joe Celko are still at loggerheads on the natural vs surrogate key debate, and if these two great minds can't agree, then the answer really has to be personal preference), and also your exact situation, in some situations you will want a composite primary key, in some instances you will want a surrogate key, in some instances you will want your primary key and your clustering key to be the same thing, in other instances you won't.

Thanks for a very detailed answer. Very helpful will probably have to read a few times to grasp everything. Yes the date column is the date of the record is uploaded so is always increasing. By static I take it you mean the values of the three columns will not change? That is the case actually for all the fields once in the table they values will not change. There is no child table either. The queries will be selecting the secID for a certain FundCode for a certain UploadDate or over a date range. — mHelpMe, Aug 24 '15 at 14:13
Yes, by static I mean never changing, the primary key, once created should not change. By the sounds of it a surrogate key might be overkill, and you should probably just use the three columns you have. Since the first column will be `UploadDate` you are unlikely to find the unique index created any use for queries that are not filtering on `UploadDate`, as such you may need further non clustered indexes. — GarethD, Aug 24 '15 at 14:39

score 1 · Answer 2 · answered Aug 24 '15 at 12:36

As your UploadDate carries an implicit sorting it could be wise to have a clustered multi-column Primary Key. Any other key will be able to use this as look up key, because the clustered key does implicitly cover all columns. Thus you should create this first...

An additional int key with random numbers should NOT be clustered (it will be extremely fragmented!).

Well, many people say, that (in almost all cases) the PK should not carry any information. This really depends on you needs.

Matt · Answer 3 · 2015-08-24T12:38:03.477

-1

The primary key itself does not speed up queries which do not depend upon it. Other DBMSs may not rely on existence of primary key (e.g. Oracle's default is heap table).

It's just MSSQL "feature" - it "likes" primary key. It was optimized for it (for that tree-structure actually). Other engine may not benefit of simple autoincrement field.

edited Aug 24 '15 at 12:38

answered Aug 24 '15 at 12:36

Matt

13,674
1
18
27

1

If you use a clustered PK it will be the look-up-key for all other indices (beacuse it implicitly covers all columns). But you should never use a clustered key with a column that is not sorted on insert operations! – Shnugo Aug 24 '15 at 12:37
@Shnugo And this makes it useless in 90% cases. Speeding up `select` a little makes `insert` to die. – Matt Aug 24 '15 at 12:49
absolutely not... In one of my projects (very old), I have to deal with CLUSTERED PKs of type UNIQUEIDENTIFIER which are worst in this case. INSERTs don't "die". The index gets a fragmentation of 99,9%... And btw: In everyday use there are fare more reading operations than inserts... – Shnugo Aug 24 '15 at 12:52
@Shnugo I said 90% useless, not 100%. And losing half an hour every day is still better then losing a week once a year. – Matt Aug 24 '15 at 12:59
1

No, I totally do not agree... I don't know what kind of data you are dealing with. Mostly I have to write classical OLTP applications. Speeding up queries is my every day work. Waiting a little longer on a save operation doesn't bother at all! I'm living on well designed indices! A good clustered PK can make things much easier! – Shnugo Aug 24 '15 at 13:02

understanding how a primary key works & how to use it

3 Answers3

Linked