4

I have a data frame (df) or data table (dt) with, let’s say 1000 variables and 1000 observations. I checked that there are no duplicates in the observations, so dt[!duplicated(dt)] has the same length as the original file.

I would like to create an ID variable for all this observation with a combination of some of the 1000 variables I have. Differently to other SO questions as I don’t know which variables are more suitable to create the ID and it is likely that I need a combination of, at least, 3 or 4 variables.

Is there any package/function in R that could get me the most efficient combination of variables to create an ID variable? In my real example I am struggling to create an ID manually, and probably it is not the best combination of variables.

Example with mtcars:

require(data.table)
example <- data.table(mtcars)
rownames(example) <- NULL # Delete mtcars row names
example <- example[!duplicated(example),]
example[,id_var_wrong := paste0(mpg,"_",cyl)]
length(unique(example$id_var_wrong)) # Wrong ID, there are only 27 different values for this variable despite 32 observations

example[,id_var_good := paste0(wt,"_",qsec)]
length(unique(example$id_var_good)) # Good ID as there are equal number of unique values as different observations.

Is there any function to find wt and qsec automatically and not manually?

user3507584
  • 3,246
  • 5
  • 42
  • 66
  • Possible duplicate of [Assign unique ID based on two columns](https://stackoverflow.com/questions/42921674/assign-unique-id-based-on-two-columns) – Florian Jul 25 '17 at 09:59
  • You are doing this on float values. So, there is the issue – akrun Jul 25 '17 at 10:03
  • 1
    Do IDs have to be human readable? If not, you could try the [uuid](https://cran.r-project.org/web/packages/uuid/index.html) package. – Roman Luštrik Jul 25 '17 at 10:08
  • 1
    How about `example[, id_var := paste(mpg, cyl, seq_len(.N), sep = "_"), by = .(mpg, cyl)]`? This will add an incrementing number to the id variables so that each row has a unique id – talat Jul 25 '17 at 10:13
  • @docendodiscimus The key is, you know mpg and cyl are good candidate variables for ID because you have seen them. But imagine creating an ID var by combination of variables when you cannot humanly review all of them. Is there any algorithm or package to do that? To find the minimum number of variables which, combined, create an ID variable? – user3507584 Jul 25 '17 at 10:36
  • 3
    Don't you have a natural key to identify a single observation? If not, create an artificial key, e.g., by numbering the rows consecutively. It's always a risk to use attributes which _may_ become identical, e.g., through rounding. – Uwe Jul 25 '17 at 10:53
  • 1
    I don't see the purpose of efficiency here. If there's a column where different values mean the observations aren't the same "thing", that's one of the ID columns. As @uwe-block answered, omitting it for "efficiency" is asking for trouble. – Nathan Werth Jul 25 '17 at 15:55
  • I'm not sure if my use case is the same as the OP but if so I can provide answers for why efficiency is important and why key based on row number won't work. I have an original data source with x rows and y variables and no unique key. I have taken a subset of it with n rows and m varaiables (n < x) (m < y). At a later date I want need additional variables which are in the orignal but not the subset - it is in practical to create a new subset. I need the best way to creat a unique ID which can link the original to the subset. – Moohan Apr 01 '19 at 16:08

3 Answers3

2

A homemade algorithm: the principle is to greedily take the variable with the most distinct number of elements and then to filter only the remaining rows with duplicates and to iterate. This doesn't give the best solution but it's an easy way to get a rather good solution quickly.

set.seed(1)
mat <- replicate(1000, sample(c(letters, LETTERS), size = 100, replace = TRUE))

library(dplyr)

columnsID <- function(mat) {
  df <- df0 <- as_data_frame(mat)
  vars <- c()
  while(nrow(df) > 0) {
    var_best <- names(which.max(lapply(df, n_distinct)))[[1]]
    vars <- append(vars, var_best)
    df <- group_by_at(df0, vars) %>% filter(n() > 1)
  }
  vars
}

columnsID(mat)
[1] "V68" "V32"
F. Privé
  • 11,423
  • 2
  • 27
  • 78
1

In many cases there is a natural key which uniquely identifies each observation. For instance, the mtcars data set has unique row names.

library(data.table)
data.table(mtcars, keep.rownames = "id")
                     id  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
 3:          Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 4:      Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
 5:   Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 6:             Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 ...

If there is no natural key available, I suggest to create an articifial key by simply numbering the rows consecutively and storing it in an additional column:

data.table(mtcars)[, rn := .I][]
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb rn
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  1
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  2
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  3
 4: 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  4
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  5
 6: 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  6
 ...

Anything else might not be worth the effort, or is at risk that attribute values may become identical, e.g., when they are rounded.

Uwe
  • 41,420
  • 11
  • 90
  • 134
0

Based on @F. Privé's answer; You can optionally specify a 'startVar' if you have an idea of what might be best, otherwise to start it will just pick the var with the max number of distinct values.

library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

columnsID <- function(dataset,
                      startVar = NULL,
                      frac = 1) {
  #Set up some temporary dataframes
  #Remove any total duplicates with distinct
  #Take a sample if not working on the full data
  tibb <- as_tibble(dataset) %>%
    distinct() %>%
    sample_frac(frac)

  #Set up the vars which will be used
  if (is.null(startVar)) {
    startVar <- names(which.max(lapply(tibb, n_distinct)))[[1]]
  }
  vars_agg <- c(startVar)
  vars_all <- names(tibb)

  #Filter out any rows which are already uniquely identified
  tibb <- tibb %>% group_by_at(vars(vars_agg)) %>%
    filter(n() > 1)

  while (nrow(tibb) > 0) {
    #Keep track of the vars we haven't used yet
    vars_unused <- setdiff(vars_all, vars_agg)

    #Find the variable which has the most distinct number of values on average
    #for the grouping we have so far
    var_best <-
      tibb %>%
      group_by(!!!syms(vars_agg)) %>%
      mutate_at(vars(vars_unused), funs(n_distinct(.))) %>%
      ungroup() %>%
      summarise_at(vars(vars_unused), funs(mean)) %>%
      which.max() %>%
      names()

    #Add the 'best variable' to the list
    vars_agg <- c(vars_agg, var_best)

    #Filter out any rows which are now uniquely identified
    tibb <- tibb %>%
      group_by_at(vars(vars_agg)) %>%
      filter(n() > 1)
  }
  vars_agg
}

columnsID(mtcars)
#> [1] "qsec" "mpg"

Created on 2019-04-02 by the reprex package (v0.2.1)

Moohan
  • 933
  • 1
  • 9
  • 27