0

To reduce privacy risk, I should scrub the SSNs from the input. I need to scrub them in a way that preserves uniqueness. E.g. if I have 111-11-1119, 111-11-1119, and 111-11-1118, we could number 111-11-1119 as 1 and 111-11-1118 as 2.

What's the best way to do that?

Brad Schoening
  • 1,281
  • 6
  • 22
  • What format is your original data in? – gmds Apr 23 '19 at 13:21
  • If you are reading a single file you can store the originals and the substitutions in a dictionary. But if you want the identity of the substitutions to persist over multiple files (what you called 2 yesterday you will call 2 again today) then you will have to store the mappings in a database or something, and that just moves the problem. So what you want to do is much easier said than done. – BoarGules Apr 23 '19 at 13:28
  • please show us your data and also your expected result. for example; is this your data = ["111-11-1119", "111-11-1119", and "111-11-1118"] and expected result = ["1", "2"] ??. – jose_bacoy Apr 23 '19 at 13:28
  • 1
    `re.sub` supports callable as a replacement. Write a function that uses a global dictionary of sanitized SSNs, storing replacement values – Marat Apr 23 '19 at 13:31
  • any crypographically secure one way hash. SHA-256 maybe – Kenny Ostrom Apr 23 '19 at 13:34
  • Depends on your use case. For "have I seen this one before" a hash would work well; but if you need to be able to see part or all of the value later, a different approach is required. If e.g. 111-11 is not sensitive information, but the last group is, you could replace the last group with a hash (whilst of course taking care to avoid making it easy to reverse engineer - maybe hash the entire input so that two SSNs with the same final group don't get the same hash). – tripleee Apr 23 '19 at 13:36
  • Original data is in Json – Asif Sabery Apr 24 '19 at 15:27
  • Marat can you gimme an example? – Asif Sabery Apr 24 '19 at 18:30

2 Answers2

1

To remove SSN or other standardized PII while preserving uniqueness, you will need a cryptographic hash function. This is not something that you should try to implement yourself with an incrementing ID and dict. To take PII seriously, you will need to do a bit of research to understand what a crytographic hash is doing and how it can protect the data.

For a previous discussion, see Cryptographic hash functions in Python

Some of these might be helpful as introductions:

  • How is hash any better than simple replacement? Both make it impossible to guess the original. – ivan_pozdeev Apr 23 '19 at 13:40
  • 1
    There are only about a billion SSNs, so if the code and "deidentified" data are available, it would be trivial to create a rainbow table, crosswalking all billion directly back to their originals. A properly keyed cryptographic hash prevents this. – Matt VanEseltine Apr 23 '19 at 13:49
  • A PRF is not sufficient as it has values where H(a) == H(b) where a != b. A PRP has the property that for a limited domain of values P(a) == P(b) if and only if a == b. – Dan D. Apr 23 '19 at 19:41
  • Original data is in JSON, hash is too complicated, i dont need to show any part of the ssn at all. is ti possible to do it with List? – Asif Sabery Apr 24 '19 at 15:26
1

Tokenization or Format-Preserving Encryption (FPE) are appropriate anonymization technique for primary keys in PII data like SSN. Both can provide consistency and uniqueness.

You can use a NIST approved FPE algorithms where a python library is now available for FF3. Alternatively, you could create a tokenizer using python regular expression generator library src-yield or hypothesis enter image description here

Brad Schoening
  • 1,281
  • 6
  • 22