0

I would like to create a unique ID from a medical ID. It sounds like a common problem, but I haven't been able to find the topic on stackoverflow or via Google. I'm new to python, so a code example would be great!

I've got several dataframes with upto 4 million rows where 5-6000 different patients exist, and I would like to be able to add more patients (a max of 5 million unique patients) with the same code and chance of uniqueness. In total I got up to 10 million rows in the finally merged dataset.

It should be near impossible to reverse engineer the generated unique ID, eventhough you know the format of the medical ID

The medical ID consist of birthday (YYMMDD), and four variables of only digit(0-9) and/or letters(A-Z).

I've read the following posts on the subject, and some questions remain unanswered:

Irreversible unique ID from String Here one describes the possibility of using rainbowtables to revese engineer the unique ID. And he describes using salt to get around the possibility of using a rainbowtable. Unfortunatly salt is something I've newer worked with.

https://www.sohamkamani.com/uuid-versions-explained/ If I use UUID v1 it's dependent on the current computers MAC-adress, which is not an option as the same unique ID should be the same independent on the computer its generated on. I can't really get my headaround the possibility to reverse engineer the unique ID using UUID v4 and a rainbowtable, as for a person with the right knowledge, it would be quiet easy to figure out the medical ID system.

Generate ID from string in Python Using a hash, wouldn't that be easily reversed engineered?

How to generate 8 digit unique identifier to replace the existing one in python pandas

So my requirements are:

  1. A unique ID generated from a medical ID
  2. No possible way to reverse it with a rainbowtable (very important, as it is sensitive information).
  3. Very little risk of collision in generating the unique ID
  4. Not dependent on MAC-adress or other unique things in a computer
  5. The same unique ID would be generated from the same medical ID independet on which computer it is generated on.
  6. Ideally a length of 10-20 digits unique ID, with no letters. But if it needs to be longer with both letters (A-Z) and numbers (0-9), so be it :)

Does any solution fit the above mentioned requirements? Could you be kind to provide a code example, if not any of the above mentioned links already have what I need?

Example: (DDMMYYXXXX) Figurative ID's from persons born in year 2022

   Medical ID  Bloodsample Date
0  0101221234  5.2         
1  0101224321  6.2         
2  311222R09B  7.6         
3  0203221234  3.8         
4  311222R09B  5.7         
5  0405229082  9.5        
6  1012225879  7.2         
7  2801226787  5.2         
8  2706221HF9  6.3         
9  3112228768  4.6         

0 and 3, and 2 and 4 are the same patients. 4 and 7 are not the same patient.

Dane Dane
  • 11
  • 2
  • 2
    I calculate about 62.5 billion possible IDs, given the scheme you describe - greatly reduced if you only consider a limited range of birth years. "Irreversible" is not a reasonable goal, given your stated requirements; whatever algorithm is chosen, an attacker only has to execute it 62.5 billion times to reverse *every* encoded ID. – jasonharper Dec 15 '21 at 19:20
  • To make a hash more secure, look at [salting](https://en.wikipedia.org/wiki/Salt_(cryptography)) and [stretching](https://en.wikipedia.org/wiki/Key_stretching). These are often used for hashing passwords, but the techniques can be applied to any piece of data that requires higher levels of security than a simple hash. – rossum Dec 16 '21 at 12:33

1 Answers1

1

DDMMYYXXXX

This information gives out the birth date which may be significant hint to identify a small group of people, not really suitable for anonymization

A unique ID generated from a medical ID...

What are you looking for may be a hash function, Cryptographic hash functions, such as SHA-256, are collision resistant. It means the probability of generating the same hash vaue for different inputs should be negligible (however mathematically never zero).

No possible way to reverse it with a rainbowtable (very important, as it is sensitive information).

A cryptographic hash would make impossible to reverse the value.

Rainbow tables are effective when having a known set of input values. For the input set MMDDYYXXXX it should be possible to generate all possible values in reasonable time. to create a reverse-lookup table

In that case you may try to use HMAC, it is a hash function with a secret key.

Unfortunately the python is not my native language, so you will have to consult your favourite search engine to search for implementation

gusto2
  • 11,210
  • 2
  • 17
  • 36
  • Thanks for the answer. I'll tried to take a look on the SH-256 combined with HMAC, and I think this can be the solution to my problem, put I didn't find any useful code examples. So if anyone else has a python code example, they are more than welcome to make a post :) – Dane Dane Dec 22 '21 at 10:24