6

Description of problem: I'm in the process of working with a highly sensitive data-set that contains the people's phone number information as one of the columns. I need to apply (encryption/hash function on them) to convert them as some encoded values and do my analysis. It can be an one-way hash - i.e, after processing with the encrypted data we wont be converting them back to original phone numbers. Essentially, am looking for an anonymizer that takes phone numbers and converts them to some random value on which I can do my processing. Suggest the best way to do about this process. Recommendations on the best algorithms to use are welcome.

Update: size of the dataset My dataset is really huge in the size of hundreds of GB.

Update: Sensitive By sensitive, I meant that phone number should not be a part of our analysis.So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value.

Update: Implementation ?

Thanks for your answers.I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link

Can you give me some example code to achieve the process , preferably in Python ?

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
Learner
  • 1,685
  • 6
  • 30
  • 42
  • I think you should explain what do you mean by sensitive, if it's sensitive as password, for instance, the idea of using hash is wrong and you should use only encryption (you encrypt the hash of the data). – TheNewOne Apr 08 '13 at 20:48
  • The main idea in hash (one of the 3 principals for a good hash function) that 2 different values should get the same value with a (very) low probability, if you don't need to protect the data (because by just using hash it would be very easy to crack it) so you can use it. – TheNewOne Apr 08 '13 at 21:09
  • @TheNewOne most encryption is non-deterministic so it would not be useful in this case. – jbtule Apr 08 '13 at 21:10
  • 1
    @jbtule - I didn't understand what you mean, many block ciphers (using IV) are deterministic, and I didn't understand your point. – TheNewOne Apr 08 '13 at 21:15
  • 4
    @jbtule If encryption was non-deterministic, you couldn't decrypt it reliably. Often times there is a PRNG involved, but all partners of the communication use the pseudo-random seed deterministically. Learner: I would suggest encryption instead of hashing; with hashing, you are likely to have collisions. With encryption, that isn't possible; you can take a really simple cipher, for example DES, which probably will be less efficient than a simple hash, but you won't have collisions and won't skew your results. – G. Bach Apr 08 '13 at 21:16
  • You will have to encrypt all your data with the same algorithm and the same keys; also make sure you do not use any chaining mode (e.g. CBC) since those also diffuse statistical properties of the clear text. Just go with an efficiently implemented block cipher and use ECB. – G. Bach Apr 08 '13 at 21:17
  • @G.Bach +1 i agree. I think -Learner should also tell what DB he uses. but anyway he should encrypt the hash and not plain data. – TheNewOne Apr 08 '13 at 21:19
  • 3
    @G.Bach No, good encryption is nondeterministic -- the encrypted string is longer than than the message. – David Eisenstat Apr 08 '13 at 21:40
  • Wow - this sounds a lot like an interview question. The main question people seem to be wrestling with is if the low probability of collision given by hashes is acceptable. Is it? Another issue is how much, if any, data set growth is acceptable? Also, what is your adversarial model - are guessing/dictionary attacks a problem? Finally, people seem stuck on trying to argue a senseless point about determinism of ciphers (some are and some aren't deterministic - almost all block ciphers are, most asymmetric and homomorphic schemes aren't but there are "good" ciphers in all those categories). – Thomas M. DuBuisson Apr 09 '13 at 00:25
  • Oh, and how must the phone numbers be used later? Do you have to be able to compare them for equality or any other such computations? Presumably you must need something from them or else you could toss the whole column out. – Thomas M. DuBuisson Apr 09 '13 at 00:26
  • Thanks a lot people. @TheNewOne I have not decided with the DB yet. Probably, mysql for the simplicity of it. Sorry to sound naive, but why does the type of db come into picture ? – Learner Apr 09 '13 at 02:23
  • @ThomasM.DuBuisson : Yeah,sure I need to compare them for equality later. After, hashing/encrypting , I would like to use them for all other processing/comparisons/analysis treating the hashed value as original phone-number."Toss the whole column out" Could you elaborate more on this statement ? Thanks :) – Learner Apr 09 '13 at 02:35
  • 1
    The db because maybe someone may recommend for the functions to use (faster/more secure etc.) according to the db you use. – TheNewOne Apr 09 '13 at 08:52
  • 1
    @TheNewOne When I says deterministic I mean given the same plaintext, produces the same ciphertext, most encryption adds an additional random or unique per encryption vector so that no two identical plaintexts produce the same ciphertext and that the ciphertext looks like random data. – jbtule Apr 09 '13 at 13:11
  • I know what you mean, and i agree with your answer. I didn't understand why you said that it's not good for him, but because you provided an answer that seems agreed for everyone (because there's key+hash) everyone is happy :). – TheNewOne Apr 09 '13 at 17:04

4 Answers4

4

Generate a key for your data set (16 or 32 bytes) and keep it secret. Use Hmac-sha1 on your data with this key, and base 64 encode that and you have a random unique string per phonenumber that isn't reversable (without the key).

Example (Hmac-Sha1 with 256bit key) using Keyczar:

Create random secret key:

$> python keyczart.py create --location=path_to_key_set --purpose=sign
$> python keyczart.py addkey --location=path_to_key_set --status=primary

Anonymize phone number:

from keyczar import keyczar

def anonymize(phone_num):
  signer = keyczar.Signer.Read("path_to_key_set");
  return signer.Sign(phone_num)
jbtule
  • 31,383
  • 12
  • 95
  • 128
  • Hashes may produce collisions, while he would have to have 2^128 distinct clear text messages to have a high likelihood to have collisions, he'll have to decide whether he wants to trade the possible efficiency loss he'll take using an encryption algorithm against the risk of skewing his data. He can test the efficiency beforehand, and considering SHA does some cryptographic operations, it may be hardly faster than an actual encryption algorithm. – G. Bach Apr 08 '13 at 21:25
  • The issue here is the lack of provable security, not the collisions. – David Eisenstat Apr 08 '13 at 21:46
  • 1
    Also, @G.Bach good thing there aren't any where near 2^128 phone numbers in existence! – jbtule Apr 08 '13 at 22:41
  • @jbtule: Thanks for the answer.Since, I'm new to this field, I am looking for elaborate implementation.I was going through python's hashlib library for hashing, Does it necessarily do the same set of steps that you suggested ? Here is the link : http://docs.python.org/2/library/hashlib.html . Can you give me some example code to achieve the process , preferably in Python ? – Learner Apr 09 '13 at 20:44
  • 1
    @Learner This is where to find [hmac](http://docs.python.org/2/library/hmac.html#module-hmac), but i updated with a more fool proof example. – jbtule Apr 09 '13 at 21:16
1

If you're going to use cryptography, you want to apply a pseudorandom function to each phone number and throw away the key. Collision-resistant hashes such as SHA-256 do not provide the right security guarantees. Really, though, are there that many different phone numbers that you can't just construct incrementally a map representing an actually random function?

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • That's a really good point about the map, no matter how large a dataset, there won't be that many phone numbers. – jbtule Apr 08 '13 at 21:58
  • With "not the right security guarantees" David means that if you simply iterate over all possible phone numbers and you calculate the hash and compare it with a stored hash, you can simply find most if not all phone numbers that were in the database. – Maarten Bodewes Apr 08 '13 at 22:22
  • 1
    @owlstead That, and that unprincipled attempts to construct a PRF from a collision-resistant hash will not be provably secure. – David Eisenstat Apr 08 '13 at 23:07
  • @DavidEisenstat How is determining phone numbers from HMACs (without the key, even) any different to finding a preimage for them? – Nick Johnson Apr 09 '13 at 15:20
  • 1
    @NickJohnson CRHFs may leak subtler forms of information than preimages that, combined with a side channel, could compromise the data set. The security guarantee for a PRF ensures that, regardless of his prior knowledge, a computationally-bounded adversary almost certainly does not "learn anything" beyond what's intended from viewing the outputs of the PRF. – David Eisenstat Apr 09 '13 at 15:30
  • @DavidEisenstat Can you give an example? My understanding is that if a hash function like SHA, or the HMAC construction itself was flawed in the way you describe, it would be fatally unsuited to its main purpose. – Nick Johnson Apr 09 '13 at 15:33
  • @NickJohnson H(x) is a function defined by concatenating SHA256(x) with 40 bits of x where the phone numbers are (appending NULs if len(x) < 10). I claim (i) H is collision-resistant (with a slightly weaker security parameter) or SHA256 is not collision-resistant and (ii) the proposed scheme of hashing the concatenation of a secret key and the phone number is completely insecure when H is the hash function. – David Eisenstat Apr 09 '13 at 15:42
  • "where the phone numbers are" means the lower 4-bits of the last 10 characters. – David Eisenstat Apr 09 '13 at 15:48
  • 1
    @DavidEisenstat I'm not sure what previous edits said, but jbtule et al are now proposing HMAC-SHA256, not concatenation. Concatenation has well known weaknesses as an HMAC construction. – Nick Johnson Apr 09 '13 at 16:47
  • 1
    Also, there was at least one deleted answer proposing something ad hoc as well. – David Eisenstat Apr 09 '13 at 17:23
1

sort your data by the respective column and start counting distinct values ... replace the actual values with their respective counter value ... collision free ... one way ...

DarkSquirrel42
  • 10,167
  • 3
  • 20
  • 31
  • Bad idea. Now if I know a few numbers from the set from an external data source, I can very much narrow down the possible values for the items I don't know. – Nick Johnson Apr 09 '13 at 15:20
  • or insert random (not repeating) numbers instead of the counter. I guess it is not so simple. – jbtule Apr 09 '13 at 16:58
  • just inserting random numbers won't make sure that the mapping is collision free ... grouping and replacing grouped values while the groups are in random order should prevent the attack @NickJohnson mentioned – DarkSquirrel42 Apr 09 '13 at 17:05
-1

"So, basically I would need a one-way hashing function but without redundancy - Each phone number should map to unique value --Two phones numbers should not map to a same value."

This screams for a solution based on a cryptographic hash function. MD5 and SHA-1 are the best known examples, and work wonderfully for this. You will read that "MD5 has been cracked", but for your purpose that doesn't matter.

Ross Patterson
  • 9,527
  • 33
  • 48
  • 1
    MD5 having "been cracked" as reported in the literature and press does indeed not matter. Nobody has reported any progress on reversing an MD5 hash back to its original input, only on producing collisions in extremely specific situations. The latter doesn't bear on the OP's question. – Ross Patterson Apr 09 '13 at 11:31
  • 3
    You can't use a hash function that's output is determined from only the phone number as the input, because brute forcing every phone number in existence is trivial, you don't even need a pregenerated rainbow table to give you all the possible reversals as there are only a fixed number of area codes and then 7 digits. Once you have the reversals the data is no longer anonymized which is relevant to the OP's question. – jbtule Apr 09 '13 at 13:14