I would like to create a unique ID from a medical ID. It sounds like a common problem, but I haven't been able to find the topic on stackoverflow or via Google. I'm new to python, so a code example would be great!
I've got several dataframes with upto 4 million rows where 5-6000 different patients exist, and I would like to be able to add more patients (a max of 5 million unique patients) with the same code and chance of uniqueness. In total I got up to 10 million rows in the finally merged dataset.
It should be near impossible to reverse engineer the generated unique ID, eventhough you know the format of the medical ID
The medical ID consist of birthday (YYMMDD), and four variables of only digit(0-9) and/or letters(A-Z).
I've read the following posts on the subject, and some questions remain unanswered:
Irreversible unique ID from String Here one describes the possibility of using rainbowtables to revese engineer the unique ID. And he describes using salt to get around the possibility of using a rainbowtable. Unfortunatly salt is something I've newer worked with.
https://www.sohamkamani.com/uuid-versions-explained/ If I use UUID v1 it's dependent on the current computers MAC-adress, which is not an option as the same unique ID should be the same independent on the computer its generated on. I can't really get my headaround the possibility to reverse engineer the unique ID using UUID v4 and a rainbowtable, as for a person with the right knowledge, it would be quiet easy to figure out the medical ID system.
Generate ID from string in Python Using a hash, wouldn't that be easily reversed engineered?
How to generate 8 digit unique identifier to replace the existing one in python pandas
So my requirements are:
- A unique ID generated from a medical ID
- No possible way to reverse it with a rainbowtable (very important, as it is sensitive information).
- Very little risk of collision in generating the unique ID
- Not dependent on MAC-adress or other unique things in a computer
- The same unique ID would be generated from the same medical ID independet on which computer it is generated on.
- Ideally a length of 10-20 digits unique ID, with no letters. But if it needs to be longer with both letters (A-Z) and numbers (0-9), so be it :)
Does any solution fit the above mentioned requirements? Could you be kind to provide a code example, if not any of the above mentioned links already have what I need?
Example: (DDMMYYXXXX) Figurative ID's from persons born in year 2022
Medical ID Bloodsample Date
0 0101221234 5.2
1 0101224321 6.2
2 311222R09B 7.6
3 0203221234 3.8
4 311222R09B 5.7
5 0405229082 9.5
6 1012225879 7.2
7 2801226787 5.2
8 2706221HF9 6.3
9 3112228768 4.6
0 and 3, and 2 and 4 are the same patients. 4 and 7 are not the same patient.