0

I am am trying to construct a database in SQL Server 2008 R2 that will allow users to place their own sub-types into categories. I have a parent table that holds the preset category names (defined by me).

The question I face is what is what is the best way to deal with the PRIMARY KEY and UNIQUE constraint, and foreign key REFERENCES. Indexing is at the center of this as I anticipate that the sub table (we will call it CategoryTypes) will grow quite large over time and will need to be able to efficiently allow reads from the data based on the parent table (Categories). Is there any problem I would need to anticipate if the tables were laid out as follows?

My concern is that the IDENTITY column in the CategoryTypes table will need to maintain a unique count. The reason I have included this field is to allow a simpler reference when passing data between tiers in the application. By passing an Integer versus an Integer / String pair. The data in these tables will persist at each layer of the database to save on bandwidth. From a database perspective, does the layout below pose any major challenges once deployed?

To simplify, is there a problem with using a unique ID field (IDENTITY) that is not included in the primary key when a composite key is present? See table layout below:

Parent Table:

CREATE TABLE schema.Categories
(
  Id TINYINT PRIMARY KEY NOT NULL,
  Name VARCHAR(100) NOT NULL,
)

Sub Table (User inserted data over time):

CREATE TABLE schema.CategoryTypes
(
   Id INT IDENTITY(1,1) NOT NULL,
   CategoryId TINYINT REFERENCES schema.Categories(Id) NOT NULL,
   Name VARCHAR(100) NOT NULL,
   CONSTRAINT PRIMARY KEY CLUSTERED(CategoryId, Name)
   CONSTRAINT UC_CategoryTypesId UNIQUE NONCLUSTERED(Id)
)
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Zack Jannsen
  • 622
  • 7
  • 17
  • PS - I would also add that the Id in the sub-table "CategoryTypes" will be referenced by other tables as via foreign key relationships ... so it is necessary. I am open to any best practices that address such complexities in table design. – Zack Jannsen Mar 03 '12 at 20:03
  • 1
    Please beware that secondary indexes are expensive in clustered tables. If you don't really need the surrogate key `{Id}`, then don't use it at all and just use the natural key `{CategoryId, Name}`. If you want both keys, use a heap-based table (i.e. `PRIMARY KEY NONCLUSTERED`). – Branko Dimitrijevic Mar 04 '12 at 02:14
  • Branko, I am aware of the extra overhead, but thank you for reiterating this in this context. The challenge I face is I know that I will need to reference the surrogate Id (foreign key). I suppose you can do a composite foreign key ... but I am not so familiar with this. The surrogate seems to make things a lot easier. That said, I could start by eliminating "CLUSTERD" indexing in favor of "NONCLUSTERED" as both you and Justin recommend. If performance is needed later I can always address after I have some production data. In CRUD - Clustered indexes help "read", but can put overhead. – Zack Jannsen Mar 04 '12 at 17:27
  • Also, I will be doing a lot with stored procedures. The surrogate would make writing these PROCS much simpler. I will go with the Heap table as recommended unless either you or Justin feel there is more cause to explore compound foreign keys. – Zack Jannsen Mar 04 '12 at 17:30
  • A foreign key can reference any key, whether it is primary or alternate, surrogate or natural, single-field or composite. I simply wanted to warn you that if you use the second key, this implies a secondary index and _any_ secondary index in the clustered table (including the index "under" the key) incurs a certain penalty. There are legitimate reasons why you would want to introduce a surrogate key beside the existing natural key; complications this might bring in the context of clustering is simply _one_ of the reasons against it. – Branko Dimitrijevic Mar 04 '12 at 18:40
  • Also, beware that the extra "covering" that the secondary index in a clustered table provides, while wasting space, might actually increase the performance of _some_ queries. For example: `SELECT CategoryId, Name FROM Categories WHERE Id = @id` could completely be [covered](http://use-the-index-luke.com/sql/glossary/covering-index) by the index "underneath" the `UNIQUE(Id)`. At the end of the day, do your own measurements of **your own queries** on representative amounts of data before arriving at a decision. – Branko Dimitrijevic Mar 04 '12 at 18:48
  • Branko, Excellent points again. Thank you. – Zack Jannsen Mar 04 '12 at 19:38
  • As always, the key takeaway here - "Choice is subject to test data & production performance". Thanks guys. – Zack Jannsen Mar 04 '12 at 19:47
  • Just letting you know that [one of your other questions got migrated to Meta](http://meta.stackexchange.com/questions/126519/i-lost-my-prevous-account-on-stackoverflow-due-to-the-forced-social-login) in case you didn't notice. – Shadow The GPT Wizard Mar 26 '12 at 14:10

1 Answers1

0

What you are describing sounds kind of like an inheritance structure. I have created an example dataset as far as I understand it. Can you verify this is your intent?

If it is, then this should work fine, and I do not see why you are not setting the CategoryType.Id as the primary key? If it is not your PK, nor being referenced as a FK elsewhere, then I dont see a point to it. I personally dont think you gain enough in bandwidth savings, and should probably just request the data by CategoryId and Name. In fact, no PK is often how inheritance structures are represented ( How can you represent inheritance in a database? ).

If you must keep it the way that you have it set up, I personally suggest setting the Id as the PK, and just setting up a unique constraint on CategoryId/Name.

That is just my two cents, though.

Category
----
Id|Name
1 |Food
2 |Drink

CategoryType
----
Id|CategoryId|Name
1 |2         |Water
2 |2         |Orange Juice

UPDATED ANSWER (to directly address performance concerns)

First, I would suggest not entirely worrying about it too much if it is not a problem. That is a common problem many of us make, overcomplicating something that does not need it. That falls under the KISS principle in my book

However, if you are deadset on trying to figure this out ahead of time the way you explained, then here are my additional thoughts:

  • Create the PK as Id, however make it NONCLUSTERED
  • Create the Clustered Index on CategoryId and consider using an INCLUDE keyword on the above nonclustered index.
  • Only do the above if you are more often going to be using the CategoryId for querying, than the CategoryType.Id
  • A consideration when creating your keys though (even taken from the INCLUDE article)
Index maintenance may increase the time that it takes to perform modifications
, inserts, updates, or deletes, to the underlying table or indexed view.

Ultimately, I think what you are doing will be fine, however a PK does not have to be clustered, so I would definitely move the PK to the Id field. It is your choice if you want to make the cluster on CategoryId or CategoryId/Name, or if you want to try using the INCLUDE as I suggested. This really will depend on how the tables are being used, so comparing execution plans might help here.

Hopefully this helps :)

Community
  • 1
  • 1
Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
  • Justin, to answer your question "Is this an Inheritance Structure" - YES. The example you laid out perfectly parallels what I am attempting to do. You comment on making the Id the PK is what my current structure in SQL has implemented. The reason I posted this in reverse is that the Index (when you have, for example, a million CategoryTypes, would it not benefit to have the clustered Index based on the composite relationship? I agree, PK does seem to make the most logical sense here as Id ... but given that PK is much the same as a Unique Clustered key I wonder which way is best for perf. – Zack Jannsen Mar 03 '12 at 20:35
  • Edited my response to focus it a little more on your more specific questions. – Justin Pihony Mar 04 '12 at 03:02