I have a data frame (df
) or data table (dt
) with, let’s say 1000 variables and 1000 observations. I checked that there are no duplicates in the observations, so dt[!duplicated(dt)]
has the same length as the original file.
I would like to create an ID variable for all this observation with a combination of some of the 1000 variables I have. Differently to other SO questions as I don’t know which variables are more suitable to create the ID and it is likely that I need a combination of, at least, 3 or 4 variables.
Is there any package/function in R that could get me the most efficient combination of variables to create an ID variable? In my real example I am struggling to create an ID manually, and probably it is not the best combination of variables.
Example with mtcars:
require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations
example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.
Is there any function to find wt
and qsec
automatically and not manually?