0

I have seen many posts regarding using the letters function in R to assign a letter ID to a row number, however my question is a bit different from that- I am looking to deidentify a dataset and would like to assign a letter to each clinician, some of whom appear in the dataset multiple times, and they have that same letter through the dataset.

For example, I want to make a column called "letter" that looks like this:

enter image description here

How would I add on that second letter column based on names from the first column? Here is some dput to play around with:

DF<-data.frame(ProviderID=c("Harry Potter","Hermione Granger","Ron Weasley", "Harry Potter", "Hermione Granger"))

Thank you!

5 Answers5

1

The biggest thing to guard against using letters is if you have more than 26 providers. Options: if there are 52 or fewer providers, then c(LETTERS, letters) may suffice; add another 10 with c(LETTERS, letters, 0:9); or go with num2alpha for an arbitrary number-to-multi-letters solution.

num2alpha <- function(num, chr = letters, zero = "", sep = "") {
  len <- length(chr)
  stopifnot(len > 1)
  signs <- ifelse(!is.na(num) & sign(num) < 0, "-", "")
  num <- as.integer(abs(num))
  is0 <- !is.na(num) & num < 1e-9
  # num[num < 1] <- NA
  out <- character(length(num))
  mult <- 0
  while (any(!is.na(num) & num > 0)) {
    not0 <- !is.na(num) & num > 0
    out[not0] <- paste0(chr[(num[not0] - 1) %% len + 1], sep, out[not0])
    num[not0] <- (num[not0] - 1) %/% len
  }
  if (nzchar(sep)) out <- sub(paste0(sep, "$"), "", out)
  out[is0] <- zero
  out[is.na(num)] <- NA
  out[!is.na(out)] <- paste0(signs[!is.na(out)], out[!is.na(out)])
  out
}

data.table

library(data.table)
as.data.table(DF)[, letter := num2alpha(match(ProviderID, unique(ProviderID)))]
#          ProviderID letter
#              <char> <char>
# 1:     Harry Potter      a
# 2: Hermione Granger      b
# 3:      Ron Weasley      c
# 4:     Harry Potter      a
# 5: Hermione Granger      b

You can use upper-case with chr=:

as.data.table(DF)[, letter := num2alpha(match(ProviderID, unique(ProviderID)), chr = LETTERS)]
#          ProviderID letter
#              <char> <char>
# 1:     Harry Potter      A
# 2: Hermione Granger      B
# 3:      Ron Weasley      C
# 4:     Harry Potter      A
# 5: Hermione Granger      B

I think the use of rleid here is ill-advised, as Harry Potter on non-consecutive rows will present different letters.

dplyr

library(dplyr)
DF %>%
  mutate(letter = num2alpha(match(ProviderID, unique(ProviderID))))

base R

DF$letter <- num2alpha(match(DF$ProviderID, unique(DF$ProviderID)))
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

Please try the below code

library(tidyverse)

DF %>% mutate(row=row_number()) %>% arrange(ProviderID) %>% 
  mutate(ids=data.table::rleid(ProviderID),
         letters=LETTERS[ids]) %>% 
  arrange(row) %>% select(-c(row,ids))

Created on 2023-08-11 with reprex v2.0.2

        ProviderID letters
1     Harry Potter       A
2 Hermione Granger       B
3      Ron Weasley       C
4     Harry Potter       A
5 Hermione Granger       B
jkatam
  • 2,691
  • 1
  • 4
  • 12
0

Try match

> transform(DF,letter = LETTERS[match(ProviderID, unique(ProviderID))])
        ProviderID letter
1     Harry Potter      A
2 Hermione Granger      B
3      Ron Weasley      C
4     Harry Potter      A
5 Hermione Granger      B
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

Using the forcats package from the tidyverse, you can assign numbers to the names as a first step. You could then replace them by letters – or by combinations of letters, similar to what r2evans suggested.

If you can rule out that there are more than 26 distinct names in your data, you can simply use LETTERS instead of c(LETTERS, letters, paste0(LETTERS, letters)) below.

library(forcats)

DF<-data.frame(ProviderID=c("Harry Potter","Hermione Granger","Ron Weasley", "Harry Potter", "Hermione Granger"))

# Count distinct names:
length(unique(DF$ProviderID))
#> [1] 3

# Assign a number to each name:
DF$Letter <- DF$ProviderID %>% 
    fct_anon() %>% 
    as.character()

# Replace the numbers by letters:
for (i in seq_along(DF$Letter)) {
    DF$Letter[i] <- c(LETTERS, letters, paste0(LETTERS, letters))[i]
}

DF
#>         ProviderID Letter
#> 1     Harry Potter      A
#> 2 Hermione Granger      B
#> 3      Ron Weasley      C
#> 4     Harry Potter      D
#> 5 Hermione Granger      E

Created on 2023-08-13 with reprex v2.0.2

lhdjung
  • 1
  • 2
0

r2evans gave a variety of good options in different packages. You could also do this with data.table without a writing a new function:

library(data.table)
DT<-data.table(DF)
IDlist<-DT[, .(ProviderID =unique(ProviderID), Letters=(LETTERS[1:length(unique(ProviderID))])) ]
DT<-IDlist[DT, on="ProviderID"] #   #Adds the Coded ID

to give:

> DT
         ProviderID Letters
1:     Harry Potter       A
2: Hermione Granger       B
3:      Ron Weasley       C
4:     Harry Potter       A
5: Hermione Granger       B

If the intent is to code the provider ID for blinding/masking, a few more lines will do this, too. With a couple of data columns added as an example:

require(data.table)
set.seed(6502)
DT<-data.table(DF)
DT[ , `:=`(Data1=runif(5, 20, 50), Data2=runif(5, 100,500))]    #  Add some data to your table
#   Make a table linking each provider with a randomly assigned letter.  This can be saved for later use.
IDlist<-DT[, .(ProviderID =unique(ProviderID), CodedID=(sample(LETTERS[1:length(unique(ProviderID))], length(unique(ProviderID))))) ]
DT<-IDlist[DT, on="ProviderID"] #   #Adds the Coded ID
DT[, ProviderID:=NULL]          #  Remove unblinded ID

Yielding:

> DT
   CodedID    Data1    Data2
1:       A 38.71399 178.8384
2:       C 45.32531 434.8654
3:       B 32.67079 420.2604
4:       A 22.49274 376.1324
5:       C 37.50478 369.3482
KJG
  • 23
  • 5
  • fyi, don't use `require` this way, see https://stackoverflow.com/a/51263513/3358272, https://yihui.org/en/2014/07/library-vs-require/, https://r-pkgs.org/namespace.html#search-path. Use `library(data.table)` or `if (!require(data.table)) { do_something_here(); }`. – r2evans Aug 13 '23 at 23:35
  • 1
    Thanks for pointing this out with the explanatory link. I've updated the code block. – KJG Aug 14 '23 at 17:54