1

I need to create a unique identifier from the combination of two variables in a data frame. Consider the following data frame:

 df <- data.frame(col1 = c("a", "a", "b", "c"), col2 = c("c", "b", "c", "a"), id = c(1,2,3,1))

The variable "id" is not in the data set; that's the one I would like to create. Essentially, I want every combination of the variables col1 and col2 to be treated interchangeably, e.g. the combination of c("a", "c") is the same as c("c", "a").

markus
  • 25,843
  • 5
  • 39
  • 58
crubba
  • 98
  • 7
  • Something like here could be of help - http://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe/25298863 – thelatemail Mar 07 '16 at 00:35

2 Answers2

3

You could do:

labels <- apply(df[, c("col1", "col2")], 1, sort)
df$id <- as.numeric(factor(apply(labels, 2, function(x) paste(x, collapse=""))))
thelatemail
  • 91,185
  • 12
  • 128
  • 188
griverorz
  • 677
  • 5
  • 11
3

A more complicated, but quicker to run version than looping over each row.

sel <- c("col1","col2")
df[sel] <- lapply(df[sel], as.character)

as.numeric(factor(apply(df[1:2], 1, function(x) toString(sort(x)) )))
#[1] 2 1 3 2

as.numeric(interaction(list(do.call(pmin,df[1:2]),do.call(pmax,df[1:2])),drop=TRUE))
#[1] 2 1 3 2

Benchmarking on 1M rows:

df2 <- df[rep(1:4, each=2.5e5),]

system.time(as.numeric(factor(apply(df2[1:2], 1, function(x) toString(sort(x)) ))))
#   user  system elapsed 
#  69.21    0.08   69.41 

system.time(as.numeric(interaction(list(do.call(pmin,df2[1:2]),do.call(pmax,df2[1:2])),drop=TRUE)))
#   user  system elapsed 
#   0.88    0.03    0.91 
thelatemail
  • 91,185
  • 12
  • 128
  • 188