1

I'm new to data warehouse, So I try to follow the best practice, mimicking some implementation details from the Microsoft Demo DB WideWorldImportersDW, One of the things that I have noticed is using Sequence as default value for PK over Identity.


Could I ask, If it's preferable to use Sequence over Identity in data warehouse in general and Which one is more convenient especially during ETL process?.

Anyname Donotcare
  • 11,113
  • 66
  • 219
  • 392

3 Answers3

4

A sequence has more guarantees than an identity column. In particular, each call to a sequence is guaranteed to produce the next value for the sequence.

However, an identity column can have gaps and other inconsistencies. This is all documented here.

Because of the additional guarantees on sequences, I suspect that they are slower. In particular, I suspect that the database cannot preallocate values in batch. That means that in a multi-threaded environments, sequences would impose serialization on transactions, slowing things down.

In general, I see identity used for identifying columns in tables. And although there is probably a performance comparison, I haven't seen one. But I suspect that sequences are a wee bit slower in some circumstances.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Could I ask what do you meant by ‘guarantee’ in this context? And if I understand the answer correctly then You prefer using identity over sequence in data Warehouse? – Anyname Donotcare Jan 10 '21 at 14:39
  • @AnynameDonotcare . . . My understanding is that a sequence is always guaranteed to return the next value, with no gaps or duplicates. Identity columns do not have that guarantee -- and hence probably have better performance. Yes, I prefer `identity` for large tables. – Gordon Linoff Jan 10 '21 at 18:59
  • Sequence can produce gaps as well, otherwise it could be abused by a *malicious* user taking a lanch after getting the `nextval` without a *commit* letting all other waitng until he is ready;) – Marmite Bomber Jan 10 '21 at 21:23
  • @GordonLinoff But https://learn.microsoft.com/en-us/sql/t-sql/statements/create-sequence-transact-sql?view=sql-server-ver15#cache-management describes both the cache management and that there may be gaps (if the engine is stopped abnormally)? The only guarantee explicitly stated is that the same value will not be issued again unless `cycle` is used. – GSerg Jan 10 '21 at 21:26
  • @MarmiteBomber [*Sequence numbers are generated outside the scope of the current transaction.*](https://learn.microsoft.com/en-us/sql/t-sql/statements/create-sequence-transact-sql?view=sql-server-ver15#general-remarks) – GSerg Jan 10 '21 at 21:27
  • For sure @GSerg this is *the reason* why they can produce gaps. I only used a negative formulation. Sorry if it was confusing. – Marmite Bomber Jan 10 '21 at 21:44
3

Both Sequence and Identity are designed for OLTP tables to enable effective assignment of unique keys in multi-session environment.

Important thing to realize is that in data warehouse environment you often have a different setup and there is only one job that populates a specific table.

In a single user environment you do not need the above features at all and you can simple assign the keys manually starting with max(id) +1 and increment by one for each row.

The general rule of data warehouse is that you should not search for silver bullet recommendation but check the functionality and preformance in your onw test.

If you make some research on SQL Server Identity vs Sequence e.g. here or here you get various result partly prefering the former partly the latter feature.

My recomendation is therefore to perform a test with the manually assigned IDs (i.e. with no overhead) simple to get a baseline for the expectation.

Than repeat it with both identity and sequence - compare and choose.

The sequence in SQL Server was added later and is based on Oracle Sequence, so I would not expect it has some basic problem.

The experience from Oracle tells us, you need to have a large enought cache in the sequence to support effective bulk insert.

In the meantime the identity can also be defined as cached, (IDENTITY_CACHE = { ON | OFF }) so once again, try all three posibilities (sequence, identity, nothing) and choose the best one.

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53
1

Identity is scoped to a single table, is part of the table definition (DDL) and is reset on a truncate. Identity is unique within the table. Each table has its own identity value when configured and cannot be shared across tables. In general usage, the "next" value is consumed by SQL Server when an Insert occurs on the table.+

Sequence is a first class object, scoped to the database. The "next" value is consumed when the Sequence is used (NEXT VALUE FOR).

Sequences are most effectively used when you need a person readable unique identifier stored across multiple tables. For example a ticketing system that stores ticket types in different tables may use a sequence to ensure no ticket receives the same number, regardless of the table in which it is stored, and that a person can reasonably refer to the number (not GUID).

In data warehousing, the dimension table needs a row identifier unique within the table. In general, the OLTP primary key is not sufficient as it may be duplicated within the dimension table depending on the type of dimension, and you don't want to risk assigning additional context to the OLTP PK as that can cause challenges when the source data changes. The dimension row identifier should only have meaning to the non-measure fact columns associated with it. Fact columns are not joined across different dimensions.++

Since the scope of the dimension table identifier is limited to the dimension table, an identity key is the ideal row identifier. It is simple to create, compact to store, and is meaningless outside the dimension. You won't use the dimension identity on a report. (Really, please don't be that developer.)

+ Its rare you'll need to know the next value without needing to assign to a row. Might be a red flag if you are trying to manipulate the identity value prior to assignment

++ a dimension view may union different tables to feed the OLAP cube, in which case a persistent repeatable key should be generated from the underlying data, usually by concatenating a string literal with each table key in a normalized format.

Wes H
  • 4,186
  • 2
  • 13
  • 24