I'm creating a data warehouse of customer data. This involves reading CSVs with various information:
cell email gender language transaction
5551212 foo@bar.com M E 005
I uniquely identify customers by using cell+email (I know this is not great, separate questions), and I'm interested in de-identifying this data such that I remove cell and email, while retaining the ability to match non-identifiable information to future records with particular customers.
One approach I had considered was hashing cell+email, using a secure hash algorithm like SHA2. So the data stored would be:
uid gender language transaction
aW51SGvswX... M E 005
When I receive additional records, I hash cell+email. If its new, I create new customer. If hash exists, increment transaction counter on customer.
If an attacker steals the DB, they would need to hash various combinations of email+cell to recover the transaction history.
I've read How to separate a person's identity from his personal data?, and of course I realize that if an attacker has prolonged access to the system, all of the records observed during that time are compromised. The specific scenario I'm looking to avoid however, is one-time theft of the DB.
I assume keystretching the hash is a good idea. I don't believe its possible to use a salt to protect against rainbow tables, since I need to calculate and store the salt before hand, and look it back up, which I can't do without the hash.
Any alternatives to this system? Any thing I'm overlooking?
Thanks,
Justin