0

Thanks to the wonderful article The Cost of GUIDs as Primary Keys, we have the COMB GUID. Based on current implementation, there are 2 approaches:

  1. use last 6 bytes for timestamp: GUIDs as fast primary keys under multiple databases
  2. use last 8 bytes for timestamp by using windows tick: GUID COMB strategy in EF4.1 (CodeFirst)

We all know that for 6 bytes timestamp at GUID, there would more bytes for random bytes to reduce the collision of the GUID. However more GUID with same timestamp would be created and those are not sequential at all. With that, 8 bytes timestamp would be preferred.

So it seems a hard choice. Based on article above GUIDs as fast primary keys under multiple databases, it says:

Before we continue, a short footnote about this approach: using a 1-millisecond-resolution timestamp means that GUIDs generated very close together might have the same timestamp value, and so will not be sequential. This might be a common occurrence for some applications, and in fact I experimented with some alternate approaches, such as using a higher-resolution timer such as System.Diagnostics.Stopwatch, or combining the timestamp with a "counter" that would guarantee the sequence continued until the timestamp updated. However, during testing I found that this made no discernible difference at all, even when dozens or even hundreds of GUIDs were being generated within the same one-millisecond window. This is consistent with what Jimmy Nilsson encountered during his testing with COMBs as well

Just wonder if someone who knows database internal could share some lights about above observation. Is it because that database server just store the data in the memory and only write to disk when it reaches certain threshold? Thus the reorder of inserted data with non sequence GUID with same time stamp would happen in general in memory and thus minimal performance penalty.

Update: Based on our testing, the COMB GUID could not reduce the table fragmentation as it is claimed over the internet compared with random GUID. It seems the only way right now is to use SQL Server to generate the sequential GUID.

Community
  • 1
  • 1
windfly2006
  • 1,703
  • 3
  • 25
  • 48
  • 1
    I think all of the articles listed are confusing *primary key* with *clustered index key*. GUIDS work well as primary keys, especially in multi-master situations, but don't work well as clustered index keys (though "doesn't work well" depends on what other columns there are in the table). – Greenstone Walker Mar 10 '14 at 21:25
  • yes, you are right. Our primary concern is due to that it would create a lot of fragmentation due to randomness of GUID which is clustered PK at our table. Any thoughts about the my questions above for performance of random GUID within same timestamp? – windfly2006 Mar 12 '14 at 17:16
  • If there is no other column to use as the clustered index key then I'd go for `newsequentialid()` (from @ErikE below). – Greenstone Walker Mar 13 '14 at 21:11

1 Answers1

2

The article you referenced is from 2002 and is very old. Just use newsequentialid (available in SQL Server 2005 and up). This guarantees that each new id you generate is greater than the previous one, solving the index fragmentation/page split issue.

Another aspect I'd like to mention, though, that the writer of that article glossed over, is that using 16 bytes when you only need 4 is not a good idea. Let's say you have a table with 500,000 rows averaging 150 bytes not including the clustered column, and the table has 3 nonclustered indexes (which repeat the clustered column in each row), each in turn with rows averaging 4 bytes, 25 bytes, and 50 bytes not counting the clustered column.

The storage requirements at perfect 100% fill factor are then (all numbers in megabytes except where %):

Item  Clust  50     25     4      Total
----  -----  -----  -----  -----  ------
GUID  79.1   31.5   19.6    9.5   139.7
 int  73.4   25.7   13.8    3.8   116.7
%imp   7.2%  18.4%  29.6%  60.0%   16.5%

In the nonclustered index having just one int column of 4 bytes (a common scenario), switching the clustered index to an int makes it 60% smaller! This translates directly into a 60% performance improvement for any scans on the table--and that's conservative, because with smaller rows, page splits will occur less often and the fragmentation will stay better longer.

Even in the clustered index itself, there's still a 7.2% performance improvement, which is not nothing, at all.

What if you used GUIDs throughout your entire database, which had tables with a similar profile as this where switching to int would yield a 16.5% reduction in size, and the database itself was 1.397 Terabytes in size? Your whole database would be 230 Gb larger (refer to the Total column, 139.7 - 116.7). That translates into real money in the real world for high-availability storage. It moves your disk purchase schedule earlier in time which is harmful to your company's bottom line.

Do not use larger data types than necessary, ever. It's like adding weight to your car for no reason: you will pay for it (if not in speed, then in fuel economy).

UPDATE

Now that I know you are creating the GUID in your client-side code, I can see more clearly the nature of your problem. If you are able to defer creating the GUID until row insertion time, here's one way to accomplish that.

First, set a default for your CustomerID column:

ALTER TABLE dbo.Customer ADD CONSTRAINT DF_Customer_CustomerID
   DEFAULT (newsequentialid()) FOR Customer;

Now you don't have to specify what value to insert for CustomerID in any INSERT, and your query could look like this:

DECLARE @Name varchar(100) = 'Acme Spy Devices';
INSERT dbo.Customer (Name)
OUTPUT inserted.CustomerID -- a GUID
VALUES (@Name);

In this very simple example, you have inserted a new row to the Customer table, and returned a rowset to the client containing the just-created value, all in one query.

If you wanted to explicitly insert VALUES (newsequentialid(), @Name) that would work, too.

ErikE
  • 48,881
  • 23
  • 151
  • 196
  • thanks very much. We are using the Entity Framework and we are using the code to generate the GUID for PK, so we are looking for some minimal code change to generate the sequential GUIDs at code. At this point, we won't be able to go back to INT as PK due to too much code change. If we are using newsequentialid, there would be an extra database round trip for generating that GUID, right? – windfly2006 Mar 07 '14 at 21:10
  • Correct, `newsequentialid()` would have to be called from the database. Sometimes code can be rewritten to get the ID *after* inserting the entire record instead of having to get the ID before--if so, no extra trip to the database. – ErikE Mar 07 '14 at 21:42
  • Thanks again. Just wonder if you could share some link or code sample for the case when extra trip to database is not required. – windfly2006 Mar 10 '14 at 18:02
  • Your disk space calculation is not accounting for slack space. – Greenstone Walker Mar 10 '14 at 21:27
  • Yes, @GreenstoneWalker, it was intentional to not give an exact accounting of actual space usage (note I said "perfect 100% fill factor"), it was simply to show a reasonable comparison between the two column's space requirements. How would you anticipate that "accounting for slack space" would affect things? Do you think it would materially affect my assertion that using 16 bytes where 4 will do is a poor design choice? – ErikE Mar 10 '14 at 22:26