8

I've been having this dilemma for a while and couldn't find any hints to it, although it seems that someone outha have done it already.

What I need is to replace sequential AUTO_INCREMENT (or equivalent) primary keys with criptographically secure (i.e. non-consecutive!) ids, but at the same time I want to keep the performance advantage of sequential PKs: guaranteed unused next ID, clusterability, etc.

A simple approach would seem to implement a cryptographic pseudo-random permutation generator to uniquely map the 2^N space to 2^N without collisions and with an initialisation vector (IV).

While this could be implemented externally, this does need to store and atomically access state (the permutation position or last id), which means implementing externally would be grossly inefficient (it's the equivalent of running a subsequent UPDATE table SET crypto_id = FN_CRYPTO(autoincrement_id) WHERE autoincrement_id=LAST_INSERT_ID() for every INSERT).

Do you know of any such implementation as described above in a database in commercial use?

Dinu
  • 1,374
  • 8
  • 21
  • why dont you just use uniqueidentifier? is this not secure enough? see example B here in the link below: – DaFi4 Jul 20 '19 at 02:27
  • https://learn.microsoft.com/en-us/sql/t-sql/functions/newid-transact-sql?view=sql-server-2017 – DaFi4 Jul 20 '19 at 02:27
  • @DaFi4 Thanks for that, I didn't know that existed; there seem to be however a few quirks with that: 1) It's a long alphanumeric sausage, much slower to deal with than a numeric ID PK. 2) Reading RFC 4122, there are 2 implementations for the UUIDS: one is random (which will generate collisions), and the other is the timestamp, in 0.1us increments, in clear (not hashed or anything). So not secure enough is understated... not secure at all. It keeps ids in increasing order at close enough distances. Also the MS doc doesn't state which algorithm it is. 3) Does this scale to distributed DBs? – Dinu Jul 20 '19 at 07:17
  • 4) How to get the last inserted ID? – Dinu Jul 20 '19 at 07:20
  • 1. its not really slower, its designed to be fast 2. its designed so that collisions do not occur. ive never seen one in 22 years. 3. yes, in fact, a primary function of using this approach is to solve problems that arise from distributed DB and data – DaFi4 Jul 20 '19 at 11:13
  • two ways to get last id: https://stackoverflow.com/questions/1509947/scope-identity-for-guids/1510529 – DaFi4 Jul 20 '19 at 11:15
  • performance: https://www.mssqltips.com/sqlservertip/5105/sql-server-performance-comparison-int-versus-guid/ – DaFi4 Jul 20 '19 at 11:15
  • collisions: https://stackoverflow.com/questions/184869/are-guid-collisions-possible – DaFi4 Jul 20 '19 at 11:16
  • @DaFi4 Thanks, but I can't find any paper describing and committing to a purpose and implementation for these GUIDs, or the underlying random generator etc (see Behrooz comment under collisions)... so I find relying on hearsay a bit risky for my taste when it comes to security. Other DBs implemented GUIDs seem just as inconsistent, too. – Dinu Jul 20 '19 at 15:55
  • Hi Dinu, I dont understand all the comments above. Just ask, its pretty easy to find: its rfc 4122 https://tools.ietf.org/html/rfc4122 https://learn.microsoft.com/en-us/sql/t-sql/functions/newid-transact-sql?view=sql-server-2017 – DaFi4 Jul 22 '19 at 04:10
  • here you can read up on the probability of a collision and also use-cases, as well as other references to other papers on the subject such as "2002 Jimmy Nilsson": https://en.wikipedia.org/wiki/Universally_unique_identifier – DaFi4 Jul 22 '19 at 04:17
  • NEWID seems to be a wrapper of CoCreateGuid, at least as of Yukon: https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/23/newsequentialid-histrorybenefits-and-implementation/ – DaFi4 Jul 22 '19 at 04:42
  • and CoCreateGuid calls uuidcreate https://learn.microsoft.com/en-us/windows/win32/api/rpcdce/nf-rpcdce-uuidcreate – DaFi4 Jul 22 '19 at 04:47
  • @DaFi4 : Really, I mean an implementation paper containing a stated purpose of cryptographic resistance for the UUIDs for now and in future, from the DB provider or in the SQL standard, current-state details of the particular algorithm used for generating the UUIDs, a trackable reference to the pseudo-random generator used and initialisation sequence. I don't mean to be a smart-ass, but when the word "cryptographic security" is written it's the minimum expected. Otherwise, the fact that in theory it might be secure does not cut it. Thank you for the effort, but I will not be using this. – Dinu Jul 22 '19 at 04:49
  • Just to give you a hint of how sensitive this is: if one was to use, say a Mersenne Twister PRNG: https://en.wikipedia.org/wiki/Mersenne_Twister "observing a sufficient number of iterations (624 in the case of MT19937, since this is the size of the state vector from which future iterations are produced) allows one to predict all future iterations.". So while this would have a good theoretical collision resistence, if a user inserts 624 records into the system and gets their IDs, he can know all future (and maybe past) IDs. – Dinu Jul 22 '19 at 05:41
  • Actually 624/4=158, since its output is 32 bits so to make 128 bits one runs 4 iterations :) – Dinu Jul 22 '19 at 05:48
  • no worries and thanks for explaining – DaFi4 Jul 24 '19 at 14:27

2 Answers2

1

While this could be implemented externally, this does need to store and atomically access state (the permutation position or last id), which means implementing externally would be grossly inefficient (it's the equivalent of running a subsequent

 UPDATE table SET crypto_id = FN_CRYPTO(autoincrement_id) 
 WHERE autoincrement_id=LAST_INSERT_ID()

You could use generated/virtual column to avoid running proposed UPDATE for every insert:

-- pseudocode
CREATE TABLE tab(
   autoincrement_id INT AUTO_INCREMENT,
   crypto_id <type> GENERATED ALWAYS AS (FN_CRYPTO(autoincrement_id)) STORED
);

-- SQL Server example, SHA function is an example and should be replaced
CREATE TABLE tab(
 autoincrement_id INT IDENTITY(1,1),
 crypto_id AS (HASHBYTES('SHA2_256',CAST(autoincrement_id AS NVARCHAR(MAX))))     PERSISTED
);

db<>fiddle demo


More info:


EDIT by Dinu

If you use SHA, don't forget to concatenate a secret salt to the autoincrement_id; alternately, you could use i.e. AES128 to encrypt the autoincrement_id with a secret password and IV.

Also worth noting: any DB user with access to the table DDL will have access to your secret salt/key/iv. If this is of concern to you, you can use a parameterized stored procedure i.e. FN_CRYPTO(id,key,iv) instead and send them along with every insert.

To retrieve the crypto_id on the app-side without needing a subsequent query, you would need to replicate the encryption function app-side to run on the returned autoincrement_id. Note: if using autoincrement_id as byte array for AES128, be very careful about endianness, it may differ DB and app-side. The only alternative is to use the OUTPUT syntax of mssql, but that is specific to mssql and it requires running the ExecuteScalar API instead of ExecuteNonQuery.

Community
  • 1
  • 1
Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • Thanks, this looks on the right track, but how to get the generated crypto_id as a result of an `INSERT`? Do note we are using an ORM app-side so masquerading a stored procedure as an `INSERT` wrapper would be a terrible idea as it would mean massive code changes. – Dinu Jul 20 '19 at 15:16
  • Maybe replicate the generation function app-side based on the identity id would work to get the crypto ID? I would use AES128 instead of hasing as it would guarantee no collision and I see it implemented elsewhere, but it's doable with most RDBs. – Dinu Jul 20 '19 at 15:30
  • @Dinu `how to get the generated crypto_id as a result of an INSERT` You could create a user-defined function in your RDBMS and set as calculated column expression. – Lukasz Szozda Jul 21 '19 at 07:43
  • No, I mean I need the generated crypto_id following the `INSERT` back in the app layer. Now, most RDBs only return the `AUTO_INCREMENT` / `IDENTITY` value. I see MS SQL has some OUTPUT syntax to return some other value, but it needs rewriting the entire ORM to use ExecuteScalar. Not very convenient. – Dinu Jul 21 '19 at 17:56
  • @Dinu `SELECT crypto_id FROM tab WHERE id =?` It is still normal column which you could easily access like any other column – Lukasz Szozda Jul 21 '19 at 17:57
  • I'm really aiming for performance here... If I need to run 2 queries, I could just run the INSERT / UPDATE sequence and bother less about it :) Duplicating the ID generation function in the app-side and re-computing app-side the crypto_id from the auto increment ID which is available is the only solution I could find, unless you know a more elegent one... – Dinu Jul 21 '19 at 17:58
  • 1
    I think I know the answer to this: use OUTPUT combined WITH SCHEMABINDING : https://stackoverflow.com/questions/6354894/get-sql-computed-column-inserted-value – DaFi4 Jul 24 '19 at 14:24
  • (also be sure to benchmark the performance on this answer, if I recall correctly, the more advanced encryption functions perform slow at scale) – DaFi4 Jul 24 '19 at 14:27
  • Well, like I said, the issue with `OUTPUT` is that it requires the `ExecuteScalar` API instead of `ExecuteNonQuery`... we do use an ORM/DBAL on top of the DB API so this would pretty much require dropping or rewriting the ORM/DBAL (the same is true for using a stored procedure). So we have to stick to regular INSERT/UPDATE/SELECT operations. With regards to the crypto functions, SHAx / AESx should run extremely fast for a single block (except some botched implementation); using a CSPRNG (cryptographical random) is VERY slow, and using a regular PRNG is generally insecure. – Dinu Jul 26 '19 at 14:45
  • 1
    @Lukasz Szozda I accepted the answer, I couldn't find any better implementation as much as I researched. I also proposed an edit to add more info about SHA salt, AES alternative and how to retrieve the crypto id app-side. – Dinu Aug 02 '19 at 06:26
-1

Just a thought... Is the DB itself secure? If so, you might consider a "key pool" table that holds a list of pseudo-random keys and a "status" column for each key in the table. Then you could assign the next key when required. The key pool can get populated during idle times and/or based on a trigger if the available keys drops below a set threshold.

Again, this method would depend on being able to secure the key pool table, but it would assure that the keys assigned would be random and unique.

Also, you would need to be sure that you don't create concurrency issues, but this could be done with a stored procedure, and still should be faster than generating the secure IDs on demand.

daShier
  • 2,056
  • 2
  • 8
  • 14
  • 1
    No, searching a "key pool" then updating said key pool is not faster than generating a crypto id on-the-fly. You are suggesting something even more inefficient than what I alreadly listed as unacceptably inefficient (INSERT/UPDATE). – Dinu Jul 26 '19 at 14:40
  • I agree that on a whole, the key pool is less efficient. But I was thinking that the pool could be repopulated during idle times, so that the overhead is distributed to less critical time periods. Of course, I don't know the specifics of your application, so I understand that this might not be a practical option for you. – daShier Jul 26 '19 at 14:56
  • The key pool still needs to be updated after you pull each ID, to mark it gone. There is absolutely no reason to create the key pool, key generation is not slow. Database operation to do repeated search/insert/updates with corresponding mutexes is slow. – Dinu Jul 26 '19 at 15:01