I am in the process of creating demo data from data that contains Patient History Information (PHI). There are a few columns where I just want to generate a random value that stays consistent throughout the data. For example, there is a field like SSN that I want to create a random 9 number digit for each unique SSN, but keeping this number the same where the claims are of the same individual. So 1 SSN might have 5 claims and each claim will have the same randomly created SSN.
sample
ssn date1 procedure
443234432 1/1/2019 needle poke
443234432 1/2/2019 needle poke
676343522 1/3/2019 total knee procedure
443234432 1/4/2019 total hip procedure
676343522 1/5/2019 needle poke
final
ssn date1 procedure
856345544 1/1/2019 needle poke
856345544 1/2/2019 needle poke
979583338 1/3/2019 total knee procedure
856345544 1/4/2019 total hip procedure
979583338 1/5/2019 needle poke
As you can see, the snn changed, but stays the same for all instances where the ssn was the same.
For numbers like this, I can convert to a numeric and multiply/divide/add/subtract to create a random number that maintains integrity, but how can I handle this for instances where there are non-numerics?
I attempted to use HASHBYTES but am getting lots of strange characters. Is there another method that could generate a random value and maintain consistency throughout the dataset?
RIGHT(convert(NVARCHAR(10), HASHBYTES('MD5', SSN)),10) as SSN
RESULTS:
댛량뇟㻣砖聋蠤
I've read a number of articles about this, but I couldn't find much about maintaining consistency across multiple claims. I appreciate any feedback.