how to scrub ssn while preserving their uniqueness?

Question

To reduce privacy risk, I should scrub the SSNs from the input. I need to scrub them in a way that preserves uniqueness. E.g. if I have 111-11-1119, 111-11-1119, and 111-11-1118, we could number 111-11-1119 as 1 and 111-11-1118 as 2.

What's the best way to do that?

If you are reading a single file you can store the originals and the substitutions in a dictionary. But if you want the identity of the substitutions to persist over multiple files (what you called 2 yesterday you will call 2 again today) then you will have to store the mappings in a database or something, and that just moves the problem. So what you want to do is much easier said than done. — BoarGules, Apr 23 '19 at 13:28
please show us your data and also your expected result. for example; is this your data = ["111-11-1119", "111-11-1119", and "111-11-1118"] and expected result = ["1", "2"] ??. — jose_bacoy, Apr 23 '19 at 13:28
`re.sub` supports callable as a replacement. Write a function that uses a global dictionary of sanitized SSNs, storing replacement values — Marat, Apr 23 '19 at 13:31
Depends on your use case. For "have I seen this one before" a hash would work well; but if you need to be able to see part or all of the value later, a different approach is required. If e.g. 111-11 is not sensitive information, but the last group is, you could replace the last group with a hash (whilst of course taking care to avoid making it easy to reverse engineer - maybe hash the entire input so that two SSNs with the same final group don't get the same hash). — tripleee, Apr 23 '19 at 13:36

score 1 · Answer 1 · answered Apr 23 '19 at 13:38

1

To remove SSN or other standardized PII while preserving uniqueness, you will need a cryptographic hash function. This is not something that you should try to implement yourself with an incrementing ID and dict. To take PII seriously, you will need to do a bit of research to understand what a crytographic hash is doing and how it can protect the data.

For a previous discussion, see Cryptographic hash functions in Python

Some of these might be helpful as introductions:

answered Apr 23 '19 at 13:38

Matt VanEseltine

149
2
7

How is hash any better than simple replacement? Both make it impossible to guess the original. – ivan_pozdeev Apr 23 '19 at 13:40
1

There are only about a billion SSNs, so if the code and "deidentified" data are available, it would be trivial to create a rainbow table, crosswalking all billion directly back to their originals. A properly keyed cryptographic hash prevents this. – Matt VanEseltine Apr 23 '19 at 13:49
A PRF is not sufficient as it has values where H(a) == H(b) where a != b. A PRP has the property that for a limited domain of values P(a) == P(b) if and only if a == b. – Dan D. Apr 23 '19 at 19:41
Original data is in JSON, hash is too complicated, i dont need to show any part of the ssn at all. is ti possible to do it with List? – Asif Sabery Apr 24 '19 at 15:26

Brad Schoening · Answer 2 · 2021-02-15T16:17:49.793

1

Tokenization or Format-Preserving Encryption (FPE) are appropriate anonymization technique for primary keys in PII data like SSN. Both can provide consistency and uniqueness.

You can use a NIST approved FPE algorithms where a python library is now available for FF3. Alternatively, you could create a tokenizer using python regular expression generator library src-yield or hypothesis

edited Feb 15 '21 at 16:17

answered Aug 22 '20 at 18:51

Brad Schoening

1,281
6
22

how to scrub ssn while preserving their uniqueness?

2 Answers2