Efficient way to fill column with numbers that identify observations with same value in column

Question

I apologize for the wording of the question and the errors. Newbie in OS and in R.

Problem: Find efficient way to fill column with numbers that uniquely identify observations with same value in another column. Result would look like this:

    patient_number id
1              46  1
2              47  2
3              15  3
4              42  4
5              33  5
6              26  6
7              37  7
8               7  8
9              33  5
10             36  9

Sample data frame

set.seed(42)
df <- data.frame(
  patient_number = sample(seq(1, 50, 1), 100, replace = TRUE)
)

What I was able to come up with

df$id <- NA  ## create id and fill with NA make if statement easier
n_unique <- length(unique(df$patient_number))  ## how many unique obs

for (i in 1:nrow(df)) {
  index_identical <- which(df$patient_number == df$patient_number[i])
  ## get index of obs with same patient_number

  if (any(is.na(df$id[index_identical]))) {
    ## if any of the ids of obs with same patient number not filled in,
    df$id[index_identical] <- setdiff(seq(1, n_unique, 1), df$id)[1]
    ## get a integer between 1 and the number of unique obs that is not used
  }

  else {
    df$id <- df$id
  }
}

It does the job, but with thousands of rows, it takes time.

Thanks for bearing with me.

Try `with(df, match(patient_number, unique(patient_number)))` or `with(df, as.integer(factor(patient_number, levels = unique(patient_number))))` or `library(data.table);setDT(df)[, id := .GRP, patient_number]` — akrun, Feb 20 '19 at 15:23

score 6 · Answer 1 · answered Feb 20 '19 at 15:27

If you're open to other packages, you can use the group_indices function from the dplyr package:

library(dplyr)
df %>%
  mutate(id = group_indices(., patient_number))

    patient_number id
1               46 40
2               47 41
3               15 14
4               42 37
5               33 28
6               26 23
7               37 32
8                7  6
9               33 28
10              36 31
11              23 21
12              36 31
13              47 41
...

Thank you. It does the job well. I didn't know that function from dplyr. — Pablo Rod, Feb 20 '19 at 15:48

score 5 · Accepted Answer · answered Feb 20 '19 at 15:28

5

We can use .GRP from data.table

library(data.table)
setDT(df)[, id := .GRP, patient_number]

Or with base R match and factor options are fast as well

df$id <- with(df, match(patient_number, unique(patient_number)))
df$id <- with(df, as.integer(factor(patient_number, 
               levels = unique(patient_number))))

answered Feb 20 '19 at 15:28

akrun

874,273
37
540
662

1

Thank you, akrun. All alternatives work well. The match solution is a good twist to the function utility, just by using base R. – Pablo Rod Feb 20 '19 at 15:45

Efficient way to fill column with numbers that identify observations with same value in column

2 Answers2