4

I have a database with 30 million rows. The PK clustered index is a code generated GUID.

The table is as follows:

CREATE TABLE [dbo].[events](
    [imageEventGUID] [uniqueidentifier] NOT NULL,
    [imageSHAID] [nvarchar](256) NOT NULL,
    [queryGUID] [uniqueidentifier] NOT NULL,
    [eventType] [int] NOT NULL,
    [eventValue] [nvarchar](2050) NULL,
    [dateOfEvent] [datetime] NOT NULL,
 CONSTRAINT [PK_store_image_event] PRIMARY KEY CLUSTERED 
(
    [imageEventGUID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

Put simply its an image search engine.

  • imageEventGUID is code unique identifier,
  • imageSHAID is the SHA256 of the image URL
  • queryGUID is a code generated FK ( excluded from the create statement for brevity )
  • eventType is a number assigned to what type of event it is
  • eventValue is usually a URI of the image e.g. "http://mywebpage.com/images/image123456789.jpg"

Periodically I insert via SqlBulkCopy (from a DataTable) into this table using pretty standard code:

using (SqlBulkCopy bulk = new SqlBulkCopy(storeConn, SqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.KeepNulls, null))
{
    bulk.DestinationTableName = "[dbo].[events]";
    bulk.WriteToServer(myeventsDataTable);
}

I'm typically trying to insert between 5k and 10k rows in one bulk insert. I'm having terrible insert results from this bulk copy. I used to run this DB on a SSD (only SATA 1 connected) and it was very fast (under 500 ms). I ran out of room on the SSD so I swapped the DB to a 1TB 7200 cache spinning disk, since then completion times are over 120 seconds (120000 MS). When the bulk insert is running I can see disk activity of around 1MB/sec, low CPU usage.

I have no other indexes on this table apart from the PK.

My questions to you are:

Can you see anything obvious that I am doing wrong which would cause this?

Is it just a case of 'your spinning disk is just not fast enough for a DB this size'?

What exactly is happening on the insert of this data? Because it is the clustered index is it re-arranging data pages on disk when an insert is made? It is trying to insert GUIDS which by nature are unordered and so it is possible that this 'random insert nature' is causing the read/write header to move around a lot to different pages on the disk?

Thanks for your time.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
user989056
  • 1,275
  • 2
  • 15
  • 33
  • You use `SqlBulkCopyOptions.KeepIdentity` but you don't have a identity column, why? Also are there other things communicating with the same table (reads or writes)? It could be lock contention. – Scott Chamberlain Aug 08 '13 at 15:19
  • 3
    `GUID` as a clustered primary key in itself is a horribly bad design choice - see Kim Tripp's blog post [GUIDs as PRIMARY KEYs and/or the clustering key](http://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/) for explanations – marc_s Aug 08 '13 at 15:20
  • @ Scott, it was copy and paste fallout from another SqlBulkCopy that I use in code before this. Bad habit, the Keep Identity can be ignored. – user989056 Aug 08 '13 at 15:21
  • @marc_s I Totally agree with marc_s, however if you are forced to keep the guid column to do interop requirements you can set the default value of the PK to [`NEWSEQUENTIALID()`](http://technet.microsoft.com/en-us/library/ms189786.aspx) and that will help mitigate the fragmentation issues it would cause (but you would need to let the insert set the value, you could not pre-set the value before you bulk insert) – Scott Chamberlain Aug 08 '13 at 15:24
  • @marc_s I'm aware of this, unfortunately my system/application does require the generating of PK values on the fly from code. It cannot query the DB to generate an ID for each record or rely on the DB managing it's own PK using identity. This is due to performance issues ( ironically ). I can expand further on this is required. – user989056 Aug 08 '13 at 15:24
  • How often is your clustered index being rebuilt? Do you have a nightly maintenance job? – marc_s Aug 08 '13 at 15:25
  • 3
    Also: you could keep your GUID column as the (non-clustered) **primary key** and introduce a new `INT IDENTITY` column to be used as the clustering key. That alone would already help quite a bit, I'm sure! – marc_s Aug 08 '13 at 15:26
  • @marc_s actually that sounds like a pretty good idea... I'll have a think about that and test it out. I'll get back to you. – user989056 Aug 08 '13 at 15:27
  • @user989056: have a read here, it's worth your time: http://stackoverflow.com/questions/1469674/using-guids-in-primary-keys-clusted-indexes – tommy_o Aug 08 '13 at 17:02
  • Thanks @marc_s as I said in the marked answer comments section, this autoInc cluserted index worked a treat, now getting insert times of < 500ms, I have another related question that you may be able to help me with, it would be appreciated if you could take a look : http://stackoverflow.com/q/18143533/989056 thanks – user989056 Aug 09 '13 at 09:31

4 Answers4

6

My guess is that the main issue is your choice of clustered index. The clustered index determines the physical order or records in the table. Since your PK is a Guid (which I'm assuming are generated randomly rather than sequentially) the database has to insert each row in the proper location, which will likely be between two existing records, which may cause page splits, fragmentation, etc.

As far as why it's faster on an SSD versus a magnetic drive, I'm no expert, but it's likely that the fragmentation process is faster on the SSD due to how it organizes the data. I/O throughput will be faster, but not by that magnitude.

If you can use a numeric autoincrement primary key instead of a GUID, then bulk inserts should be MUCH faster. You can still create unique indices on the GUID column to make queries faster.

D Stanley
  • 149,601
  • 11
  • 178
  • 240
  • In Addition - Especially if you're looking for a faster solution, look into using Sequential Guids in the meantime, although this would still probably require a primary key update, which isn't necessarily simple depending on the size of your system – Sethcran Aug 08 '13 at 15:29
  • Sequential GUIDS aren't an option. @D.Stanley, currently testing this autoincrement Clustered index and GUID non-CI. I'll let you know, thanks. – user989056 Aug 08 '13 at 15:30
  • OP can also increase the fill factor on the clustered index to anticipate the future splits (although this isn't addressing the root cause; it just attempts to mitigate the symptoms). Also a good read: http://stackoverflow.com/questions/1469674/using-guids-in-primary-keys-clusted-indexes – tommy_o Aug 08 '13 at 17:01
  • I tried what you and @marc_s suggested, I have marked this as the answer because marc_s didn't submit an answer, thank you both for your help. On a related note, I have another question which may be of interest to you regarding the same project, a MERGE performance question: http://stackoverflow.com/q/18143533/989056 thanks again. – user989056 Aug 09 '13 at 09:30
1

try to use a default constraint with newsequentialid() on the imageEventGUID column.

It will insert the GUIDs in the correct order, so SQL Server wont have to rearrange the table on each insert

Diego
  • 34,802
  • 21
  • 91
  • 134
  • This would **only work** if the app isn't providing values for that column (which it seems to be doing). If the app already provides a value to the `INSERT` command, then the default constraint won't be used.... – marc_s Aug 08 '13 at 15:29
  • you are right, its a "test" solution to try to identify the problem. if he concludes that this is indeed the problem, he may consider changing the logic that creates the GUIDs, if not he can look somewhere else – Diego Aug 08 '13 at 15:35
1

GUID as a clustered primary key in itself is a horribly bad design choice - see Kim Tripp's blog post GUIDs as PRIMARY KEYs and/or the clustering key for explanations. Using a random (client-side generated) GUID will lead to very high (often 99% or more) fragmentation, and in the process of bulk inserting a lot of rows, it will cause tons of page splits which are very expensive operations.

If you can't change that - you can at least make sure that clustered index which will have horrible fragmentation values is being rebuilt every night - or even more frequently, if you can afford to.

And you could also keep your GUID column as the (non-clustered) primary key and introduce a new INT IDENTITY column to be used as the clustering key. That alone would already help quite a bit, I'm sure, by eliminating the outrageous fragmentation that the very random GUIDs will cause on your clustered index.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
0

You can disable other indexes but not a clustered PK.
Well can disable a clustered PK but that disables the table.
If the data is not loading in the order of the PK then you will get rapid index fragmentation.
As fragmentation increases then insert speed decreases.

Understand you cannot control the GUID

But a few options.

Use a fill factor on [PK_store_image_event] of like 50, 20, or 10
This leaves space for inserts but at the cost of a larger index size on disk
Periodically rebuild the index - minimum nightly .

Can you sort the data prior to the load?
If so load sorted by the PK.
If you have the data in a DataTable then you can sort it.
You won't be to use your existing load code but you can sort it.
TVP is an option.

Use an iden for the PK and unique index on [imageEventGUID].
If it has a unique index it can be a FK.
Disable that index, load, then rebuild.
The rebuild will fail if you have a duplicate.

Or as a variation of the above just skip the iden PK.

paparazzo
  • 44,497
  • 23
  • 105
  • 176