1

I've a medical data of approximately 10,000 patients. I want to replace their IDs/Social Security Numbers (Patient_SSN) with a unique ID for each patient. Please note that some of the rows have the same participant SSN, this is is because the data is stored on visit level. In other words, each visit is stored in a new row (i.e. with different date), such as 'Mary' and 'John' data.

Patient_Name = c("Alex", "Mary", "Sarah", "John", "Susan", "Jessica", "Sarah", "Karen", "Mary", "John")
Patient_SSN  =  c(1234,    43251,    9320,    2901,  3229,     4291,     9320,    9218988,    43251 ,  2901)
Visit_Date   =  c('10_21', '10_21',  '10_25', '10_25','10_26','10_27','10_28','10_28','10_28' ,'10_29')
BMI = runif(10, min=12, max =25);

data_hospital = data.frame(Patient_Name, Patient_SSN, BMI, Visit_Date)

My question is: how can replace each SSN with a new ID for participant privacy, but keep in mind that some rows have the same SSN? The length of the characters of the new SSNs/IDs should be the same as the length of the original Patient_SSN characters. Thank you in advance for assistance.

Identicon
  • 129
  • 9
  • This post shows only single numbers (seems sequence), not random numbers - I'm looking for the same length of Patient_SSN if that's possible. Thanks. – Identicon Oct 13 '21 at 22:21
  • in that case... https://stackoverflow.com/questions/63163337/create-unique-random-group-id-in-r – Skaqqs Oct 13 '21 at 22:26
  • 1
    data_hospital %>% mutate(NEW_ID = sample(5000)[group_indices(.,Patient_SSN)]) – Bloxx Oct 13 '21 at 22:29
  • I updated my answer so that Patient_SSN to newid is 1:1, and the lengths match. – Skaqqs Oct 14 '21 at 12:52

2 Answers2

3

dplyr has a function for that! Check out ?group_data:

library(dplyr)
data_hospital$newid <- data_hospital %>% group_indices(Patient_SSN)

   Patient_Name Patient_SSN      BMI Visit_Date newid
1          Alex        1234 21.70192      10_21     1
2          Mary       43251 18.75820      10_21     6
3         Sarah        9320 22.84921      10_25     5
4          John        2901 19.94831      10_25     2
5         Susan        3229 20.27007      10_26     3
6       Jessica        4291 14.39934      10_27     4
7         Sarah        9320 16.65728      10_28     5
8         Karen     9218988 17.99142      10_28     7
9          Mary       43251 20.71236      10_28     6
10         John        2901 12.67764      10_29     2

Edit: Based on good ideas from @Bloxx and Tjn25

data_hospital %>%
  group_by(Patient_SSN) %>%
  mutate(id = paste(sample(0:9, nchar(Patient_SSN), replace=TRUE), collapse=""))

# A tibble: 10 x 5
# Groups:   Patient_SSN [7]
   Patient_Name Patient_SSN   BMI Visit_Date id     
   <chr>              <dbl> <dbl> <chr>      <chr>  
 1 Alex                1234  12.1 10_21      7076   
 2 Mary               43251  17.3 10_21      04734  
 3 Sarah               9320  14.6 10_25      0161   
 4 John                2901  15.5 10_25      9063   
 5 Susan               3229  23.3 10_26      5817   
 6 Jessica             4291  17.1 10_27      1791   
 7 Sarah               9320  23.3 10_28      0161   
 8 Karen            9218988  23.7 10_28      8627443
 9 Mary               43251  23.1 10_28      04734  
10 John                2901  20.0 10_29      9063 
Skaqqs
  • 4,010
  • 1
  • 7
  • 21
2

One way to do it, if you want the length of the Pateint_SSN to be kept, would be to generate a random number between 0 and 1, and multiply it by 10^(length_of_number).

This won't guarantee they are unique IDs so you would need to check for that and generate new numbers if there are duplicates but that is unlikely to occur.

library(dplyr)
data_hospital <- data_hospital %>% mutate(id_length = nchar(Patient_SSN))
data_hospital$random_number <- runif(n = nrow(data_hospital),min = 0, max = 1)
data_hospital <- data_hospital %>% mutate(new_id = round(random_number*10^id_length))
Tjn25
  • 685
  • 5
  • 18
  • This method works but has a scientific issue. Randomization means you remove the trace back to the original. By keeping the length from the original, one could trace back. Especially when using patients data, this would not be accepted in Ethnic Health Assessment. – Bloxx Oct 13 '21 at 22:39
  • @Bloxx Is there a way of avoiding the traceback and keeping the original length? This was specified in the question. – Tjn25 Oct 13 '21 at 22:52
  • 2
    Sure, in the way that you do not get any feature from the original variable. I think that sample() does the trick... Try the code I posted in the comment under your original question. you can change the number in sample function... in that case the only "feature" from the original SSN is the uniqnes... so that each unique value is assigned a random number. – Bloxx Oct 14 '21 at 12:21