Assign unique ID based on two columns

Question

I have a dataframe (df) that looks like this:

School Student  Year  
A         10    1999
A         10    2000
A         20    1999
A         20    2000
A         20    2001
B         10    1999
B         10    2000

And I would like to create a person ID column so that df looks like this:

ID School Student  Year  
1   A         10    1999
1   A         10    2000
2   A         20    1999
2   A         20    2000
2   A         20    2001
3   B         10    1999
3   B         10    2000

In other words, the ID variable indicates which person it is in the dataset, accounting for both Student number and School membership (here we have 3 students total).

I did df$ID <- df$Student and tried to request the value +1 if c("School", "Student) was unique. It isn't working. Help appreciated.

`as.numeric(factor(paste0(df$School, df$Student)))` – Ronak Shah Mar 21 '17 at 08:31 — Ronak Shah, Mar 21 '17 at 08:31

akrun · Accepted Answer · 2019-03-20T05:56:22.230

30

We can do this in base R without doing any group by operation

df$ID <- cumsum(!duplicated(df[1:2]))
df
#   School Student Year ID
#1      A      10 1999  1
#2      A      10 2000  1
#3      A      20 1999  2
#4      A      20 2000  2
#5      A      20 2001  2
#6      B      10 1999  3
#7      B      10 2000  3

NOTE: Assuming that 'School' and 'Student' are ordered

Or using tidyverse

library(dplyr)
df %>% 
    mutate(ID = group_indices_(df, .dots=c("School", "Student"))) 
#  School Student Year ID
#1      A      10 1999  1
#2      A      10 2000  1
#3      A      20 1999  2
#4      A      20 2000  2
#5      A      20 2001  2
#6      B      10 1999  3
#7      B      10 2000  3

As @radek mentioned, in the recent version (dplyr_0.8.0), we get the notification that group_indices_ is deprecated, instead use group_indices

df %>% 
   mutate(ID = group_indices(., School, Student))

edited Mar 20 '19 at 05:56

answered Mar 21 '17 at 08:28

akrun

874,273
37
540
662

2

I did the first one but had to write it as cumsum(!duplicated(df$1,df$2)) to get it to work. Thanks! – iPlexpen Mar 21 '17 at 17:40
1

@Quixotic The `duplicated` works on a vector or data.frame/matrix, but if you use two vectors as arguments, it may not work – akrun Mar 21 '17 at 17:42
2

`group_indices_()` is deprecated. Should be now `mutate(ID = group_indices(df, School, Student))`? – radek Mar 20 '19 at 05:53

score 12 · Answer 2 · answered Mar 21 '17 at 08:27

Group by School and Student, then assign group id to ID variable.

library('data.table')
df[, ID := .GRP, by = .(School, Student)]

#    School Student Year ID
# 1:      A      10 1999  1
# 2:      A      10 2000  1
# 3:      A      20 1999  2
# 4:      A      20 2000  2
# 5:      A      20 2001  2
# 6:      B      10 1999  3
# 7:      B      10 2000  3

Data:

df <- fread('School Student  Year  
A         10    1999
      A         10    2000
      A         20    1999
      A         20    2000
      A         20    2001
      B         10    1999
      B         10    2000')

Assign unique ID based on two columns

2 Answers2

Linked

Related