Securely de-identifying while retaining unqiue IDs

Question

I'm creating a data warehouse of customer data. This involves reading CSVs with various information:

cell      email         gender  language  transaction
5551212   foo@bar.com     M        E         005

I uniquely identify customers by using cell+email (I know this is not great, separate questions), and I'm interested in de-identifying this data such that I remove cell and email, while retaining the ability to match non-identifiable information to future records with particular customers.

One approach I had considered was hashing cell+email, using a secure hash algorithm like SHA2. So the data stored would be:

uid         gender  language  transaction
aW51SGvswX...     M        E         005

When I receive additional records, I hash cell+email. If its new, I create new customer. If hash exists, increment transaction counter on customer.

If an attacker steals the DB, they would need to hash various combinations of email+cell to recover the transaction history.

I've read How to separate a person's identity from his personal data?, and of course I realize that if an attacker has prolonged access to the system, all of the records observed during that time are compromised. The specific scenario I'm looking to avoid however, is one-time theft of the DB.

I assume keystretching the hash is a good idea. I don't believe its possible to use a salt to protect against rainbow tables, since I need to calculate and store the salt before hand, and look it back up, which I can't do without the hash.

Any alternatives to this system? Any thing I'm overlooking?

Thanks,

Justin

score 3 · Answer 1 · answered Apr 03 '11 at 18:16

This is a place where a surrogate key (e.g. an identity or autonumbering column) would help. Rather than hashing existing data to identify it back to the original rows, I'd either add a surrogate key to source table or I'd create a mapping table of surrogate keys-to-cell+email and use the surrogate key in my data warehouse. If the data warehouse gets compromised, the attacker only has an arbitrary reference value. One big reason for using a surrogate key is the performance of queries against the data warehouse should be noticeably better than a hashed value.

Btw, the assumption with this solution is that only the data warehouse is compromised. It is assumed that the source databases are protected from attack via encryption or other such means.

Good suggestion for performance down the road. The "data warehouse" I was talking about in actuality will live on the same DB server for a while, so compromising one will probably compromise the other unfortunately. — Allyl Isocyanate, Apr 04 '11 at 15:06

score 1 · Answer 2 · answered Apr 03 '11 at 18:04

1

You could encrypt the cell+email with a symmetric key before you hash it. This would make successful lookups in rainbow tables less likely. However, if the DB is compromised, it is likely that the attacker has also access to the symmetric key.

answered Apr 03 '11 at 18:04

monken

116
1
4

Interesting suggestion. Along those lines guess I could figure out something with public/private key, retaining the private key locally. Could still encrypt and look up the values, retrieve secret value as necessary. – Allyl Isocyanate Apr 04 '11 at 15:07
if it is about rainbow table protection, use a salt instead. – Jacco Apr 07 '11 at 10:56

Securely de-identifying while retaining unqiue IDs

2 Answers2