2

I have a database of TCG cards and I am trying to decide a primary key. I settled with a surrogate key initially but I realized that sometimes, there are cards that I forget to add, like promo cards for example. This is problematic with the surrogate key because they get added to the database with the latest auto increment and I didn't want their IDs to depend on the order they were inserted. I was thinking maybe I could do a hash on some of the card features and use that as the primary key instead?

Take for example the following pseudo-code:

// set code, date released, collector number, name
$crc = crc32(implode(',', ['A', '1993-08-03', '232a', 'black lotus']));
echo $crc; // 4199975187

The possible amount of cards hovers just around 25k now and grows around 100-300 every 6 months.

  1. At this rate, there won't be a collision right?
  2. Is this good practice? Do I have any other good alternatives?

I know I can make the hash shorter by converting it to base 62 but I will be joining these to the users' inventories table so I think maintaining these in int will be the best option.

voldomazta
  • 1,300
  • 1
  • 10
  • 19
  • 1
    In general, "intelligent key" (such as the key generated from concatenated columns), is a _bad_ idea. Would you please explain why `surrogate key` does not solve your problem? Key is just a unique number, how does it matter if the value of the key also happens to correspond to the order of insertion? – hashbrown Mar 25 '15 at 04:06
  • That `ID` is going to be visible to the users viewing the app. Most likely in a $_GET parameter. I did not want them to guess the order the cards were inserted in the database in based on the IDs they had. – voldomazta Mar 25 '15 at 04:11
  • Why does the order of insert matter to the end user? It's not like you are exposing table names or actual secrets. Most web applications utilize identities for keys. A large percentage expose those keys somewhere. It's not a big deal. – Tim Mar 25 '15 at 04:32

2 Answers2

3

I take exception to this:

This is problematic with the surrogate key because they get added to the database with the latest auto increment and I didn't want their IDs to depend on the order they were inserted.

ID (correctly: "Id", as it's an abbreviation, not an initialism) is short for "Identity", which has the single property of being unique for each element, that is, it is used to identify each element. You should attach no other connotation to that, so the fact it monotonically increases with insertion order is irrelevant and sorting data by a generated identity column is meaningless (unless it's inside an index used for lookup-by-Id). You should consider Ids as opaque handles in that case.

Certainly if you use a digest (such as CRC-32) the sort-order of Ids is meaningless, but actually presents less utility than having a monotonically increasing Id.

You correctly identified the risk of a hash collision, the range space of CRC-32 is only 32 bits, and if you have 25,000 cards then the Birthday Problem tells us the odds of a hash-collision in such a small range space is not insignificant.

Just use the auto-generated Id value. :)

Using a computed hash as a identifier / key does have utility - this is how hash-tables work, in that it allows you to quickly find something by value, without actually searching all of the table (e.g. to find the Black Lotus card, take the hash of its properties as you do, then lookup the computed hash in the ID column, without having to do SELECT ... WHERE code = 'A' AND ... AND name = 'black lotus', but it does require you to know every property value first, and if you set up the right table indexes this quickly becomes moot.

The main problem with using a hash as a primary key is that:

  1. Primary Keys should be immutable ("never-changing")
  2. The key now depends on the data
  3. If the data changes, (e.g. "blcak lotus" becomes "Black Lotus") then the key is invalid and must be recreated, but you can't do that because keys are immutable... rendering the previously-computed Id invalid.

QED :)

Dai
  • 141,631
  • 28
  • 261
  • 374
  • Great answer. Would add that good primary keys are important for proper indexing, and integers are much more efficient to create indexes on. – Tim Mar 25 '15 at 04:46
  • @Tim The efficiency gains are insignificant today - the cost of indexing a column of 4-byte integer value is roughly the same as a 4-byte character value, for example - even an `nvarchar(255)` value can be looked-up in an index very quickly. Even if you microbenchmarked it (itself an unrealistic scenario given everything else that happens in most database application) I doubt you would see any difference at all for strings under 20 characters or so. – Dai Jun 28 '17 at 23:53
  • That is so wrong at an Enterprise level. The importance of ordered, equally distributed keys is a night and day difference for DB performance when any significant number of rows, and with significant query traffic. If you have a few hundred rows and a handful of queries a second, any key will do. Scale that to millions of rows with thousands of queries a second, and your DB will melt with a poorly designed key. – Tim Jun 29 '17 at 00:28
  • @Tim That depends entirely on the implementation of the index and the DBMS itself (as there are many different indexing structures a DBMS could use). I'm not aware of SQL Server or MySQL being affected by key value distribution. Do you have a resource or paper that documents - or a benchmark that demonstrates - performance negation from uneven key value distribution? Note that in distributed systems you have a point, yes, but distributed systems generally cannot use monotonically-increasing keys anyway. – Dai Jun 29 '17 at 00:32
  • There's plenty of articles out there detailing how to pick a good key. Most SQL based DBMS's benefit greatly from a well chosen PK, for several reasons. 1. Clustered indexes determine disk layout. This means that a poorly chosen key can make the pages that are retrieved fragmented, and difficult to sort for joins. 2. Indexes that are not equally distributed cause expensive inserts when data must be moved around on pages to fit in values that go into the middle of existing key ranges. 3. Non distributed keys make for less efficient query plans, because the number of pages required to be read – Tim Jun 29 '17 at 00:45
0

Study this link:- http://preshing.com/20110504/hash-collision-probabilities/

You will see that for any 32 bit hash and 25K plus values the odds of a collision nearly 1 in 10 -- no matter how good the hash algorithm.

You either need a mechanism for dealing with collisions or switch to a 64 bit hash algorithm. CRC64 would seem to be good enough according to this crc collisions at various bit sizes 18.2 million samples with zero collisions.

James Anderson
  • 27,109
  • 7
  • 50
  • 78
  • While your hash comments are spot on, the use of a hash of days values for a primary key is a *very* bad idea. – Tim Mar 25 '15 at 04:29
  • @Tim -- on the face of it you are correct. But I suspect that the OP could have a valid reason for a hash (not apparent from his question) and anyway I want to make sure that any casual reader of this thread does not go away with the impression that CRC32 is a good way to derive unique keys. – James Anderson Mar 25 '15 at 04:43