What primary key strategy would be best to use for a relational database model given the following?
- tens of thousands of users
- multiple clients per user (phone, tablet, desktop)
- millions of rows per table (continually growing)
Azure SQL will be the central data store which will be exposed via Web API. The clients will include a web application and a number of native apps including iOS, Android, Mac, Windows 8, etc. The web application will require an “always on” connection and will not have a local data store but will instead retrieve and update via the api - think CRUD via RESTful API.
All other clients (phone, tablet, desktop) will have a local db (SQLite). On first use of this type of client the user must authenticate and sync. Once authenticated and synced, these clients can operate in an offline mode (creating, deleting and updating records in the local SQLite db). These changes will eventually sync with the Azure backend.
The distributed nature of the databases leaves us with a primary key problem and the reason for asking this question.
Here is what we have considered thus far:
GUID
Each client creates it’s own keys. On sync, there is a very small chance for a duplicate key but we would need to account for it by writing functionality into each client to update all relationships with a new key. GUIDs are big and when multiple foreign keys per table are considered, storage may become an issue over time. Likely the biggest problem is the random nature of GUIDs which means that they can not (or should not) be used as the clustered index due to fragmentation. This means we would need to create a clustered index (perhaps arbitrary) for each table.
Identity
Each client creates it’s own primary keys. On sync, these keys are replaced with server generated keys. This adds additional complexity to the syncing process and forces each client to “fix” their keys including all foreign keys on related tables.
Composite
Each client is assigned a client id on first sync. This client id is used in conjunction with a local auto-incrementing id as a composite primary key for each table. This composite key will be unique so there should be no conflicts on sync but it does mean that most tables will require a composite primary key. Performance and query complexity is the concern here.
HiLo (Merged Composite)
Like the composite approach, each client is assigned a client id (int32) on the first sync The client id is merged with a unique local id (int32) into a single column to make an application wide unique id (int64). This should result in no conflicts during sync. While there is more order to these keys vs GUIDs since the ids generated by each client are sequential, there will be thousands of unique client-ids, so do we still run the risk of fragmentation on our clustered index?
Are we overlooking something? Are there any other approaches worth investigating? A discussion of the pros and cons of each approach would be quite helpful.