Assign a letter to each name in R to deidentify a dataset

Question

I have seen many posts regarding using the letters function in R to assign a letter ID to a row number, however my question is a bit different from that- I am looking to deidentify a dataset and would like to assign a letter to each clinician, some of whom appear in the dataset multiple times, and they have that same letter through the dataset.

For example, I want to make a column called "letter" that looks like this:

How would I add on that second letter column based on names from the first column? Here is some dput to play around with:

DF<-data.frame(ProviderID=c("Harry Potter","Hermione Granger","Ron Weasley", "Harry Potter", "Hermione Granger"))

Thank you!

score 1 · Accepted Answer · answered Aug 12 '23 at 14:41

The biggest thing to guard against using letters is if you have more than 26 providers. Options: if there are 52 or fewer providers, then c(LETTERS, letters) may suffice; add another 10 with c(LETTERS, letters, 0:9); or go with num2alpha for an arbitrary number-to-multi-letters solution.

num2alpha <- function(num, chr = letters, zero = "", sep = "") {
  len <- length(chr)
  stopifnot(len > 1)
  signs <- ifelse(!is.na(num) & sign(num) < 0, "-", "")
  num <- as.integer(abs(num))
  is0 <- !is.na(num) & num < 1e-9
  # num[num < 1] <- NA
  out <- character(length(num))
  mult <- 0
  while (any(!is.na(num) & num > 0)) {
    not0 <- !is.na(num) & num > 0
    out[not0] <- paste0(chr[(num[not0] - 1) %% len + 1], sep, out[not0])
    num[not0] <- (num[not0] - 1) %/% len
  }
  if (nzchar(sep)) out <- sub(paste0(sep, "$"), "", out)
  out[is0] <- zero
  out[is.na(num)] <- NA
  out[!is.na(out)] <- paste0(signs[!is.na(out)], out[!is.na(out)])
  out
}

data.table

library(data.table)
as.data.table(DF)[, letter := num2alpha(match(ProviderID, unique(ProviderID)))]
#          ProviderID letter
#              <char> <char>
# 1:     Harry Potter      a
# 2: Hermione Granger      b
# 3:      Ron Weasley      c
# 4:     Harry Potter      a
# 5: Hermione Granger      b

You can use upper-case with chr=:

as.data.table(DF)[, letter := num2alpha(match(ProviderID, unique(ProviderID)), chr = LETTERS)]
#          ProviderID letter
#              <char> <char>
# 1:     Harry Potter      A
# 2: Hermione Granger      B
# 3:      Ron Weasley      C
# 4:     Harry Potter      A
# 5: Hermione Granger      B

I think the use of rleid here is ill-advised, as Harry Potter on non-consecutive rows will present different letters.

dplyr

library(dplyr)
DF %>%
  mutate(letter = num2alpha(match(ProviderID, unique(ProviderID))))

base R

DF$letter <- num2alpha(match(DF$ProviderID, unique(DF$ProviderID)))

score 0 · Answer 2 · answered Aug 11 '23 at 22:14

Please try the below code

library(tidyverse)

DF %>% mutate(row=row_number()) %>% arrange(ProviderID) %>% 
  mutate(ids=data.table::rleid(ProviderID),
         letters=LETTERS[ids]) %>% 
  arrange(row) %>% select(-c(row,ids))

^{Created on 2023-08-11 with reprex v2.0.2}

        ProviderID letters
1     Harry Potter       A
2 Hermione Granger       B
3      Ron Weasley       C
4     Harry Potter       A
5 Hermione Granger       B

score 0 · Answer 3 · answered Aug 11 '23 at 22:14

0

Try match

> transform(DF,letter = LETTERS[match(ProviderID, unique(ProviderID))])
        ProviderID letter
1     Harry Potter      A
2 Hermione Granger      B
3      Ron Weasley      C
4     Harry Potter      A
5 Hermione Granger      B

answered Aug 11 '23 at 22:14

ThomasIsCoding

96,636
9
24
81

score 0 · Answer 4 · answered Aug 13 '23 at 12:15

Using the forcats package from the tidyverse, you can assign numbers to the names as a first step. You could then replace them by letters – or by combinations of letters, similar to what r2evans suggested.

If you can rule out that there are more than 26 distinct names in your data, you can simply use LETTERS instead of c(LETTERS, letters, paste0(LETTERS, letters)) below.

library(forcats)

DF<-data.frame(ProviderID=c("Harry Potter","Hermione Granger","Ron Weasley", "Harry Potter", "Hermione Granger"))

# Count distinct names:
length(unique(DF$ProviderID))
#> [1] 3

# Assign a number to each name:
DF$Letter <- DF$ProviderID %>% 
    fct_anon() %>% 
    as.character()

# Replace the numbers by letters:
for (i in seq_along(DF$Letter)) {
    DF$Letter[i] <- c(LETTERS, letters, paste0(LETTERS, letters))[i]
}

DF
#>         ProviderID Letter
#> 1     Harry Potter      A
#> 2 Hermione Granger      B
#> 3      Ron Weasley      C
#> 4     Harry Potter      D
#> 5 Hermione Granger      E

^{Created on 2023-08-13 with reprex v2.0.2}

KJG · Answer 5 · 2023-08-14T17:51:11.983

r2evans gave a variety of good options in different packages. You could also do this with data.table without a writing a new function:

library(data.table)
DT<-data.table(DF)
IDlist<-DT[, .(ProviderID =unique(ProviderID), Letters=(LETTERS[1:length(unique(ProviderID))])) ]
DT<-IDlist[DT, on="ProviderID"] #   #Adds the Coded ID

to give:

> DT
         ProviderID Letters
1:     Harry Potter       A
2: Hermione Granger       B
3:      Ron Weasley       C
4:     Harry Potter       A
5: Hermione Granger       B

If the intent is to code the provider ID for blinding/masking, a few more lines will do this, too. With a couple of data columns added as an example:

require(data.table)
set.seed(6502)
DT<-data.table(DF)
DT[ , `:=`(Data1=runif(5, 20, 50), Data2=runif(5, 100,500))]    #  Add some data to your table
#   Make a table linking each provider with a randomly assigned letter.  This can be saved for later use.
IDlist<-DT[, .(ProviderID =unique(ProviderID), CodedID=(sample(LETTERS[1:length(unique(ProviderID))], length(unique(ProviderID))))) ]
DT<-IDlist[DT, on="ProviderID"] #   #Adds the Coded ID
DT[, ProviderID:=NULL]          #  Remove unblinded ID

Yielding:

> DT
   CodedID    Data1    Data2
1:       A 38.71399 178.8384
2:       C 45.32531 434.8654
3:       B 32.67079 420.2604
4:       A 22.49274 376.1324
5:       C 37.50478 369.3482

fyi, don't use `require` this way, see https://stackoverflow.com/a/51263513/3358272, https://yihui.org/en/2014/07/library-vs-require/, https://r-pkgs.org/namespace.html#search-path. Use `library(data.table)` or `if (!require(data.table)) { do_something_here(); }`. — r2evans, Aug 13 '23 at 23:35
Thanks for pointing this out with the explanatory link. I've updated the code block. — KJG, Aug 14 '23 at 17:54

Assign a letter to each name in R to deidentify a dataset

5 Answers5

data.table

dplyr

base R