-2

How do I remove duplicate columns with data.table? (keeping just one of them)

I know there are other questions about duplicate columns but they just check for duplicate column names not for the content,

What I want is to look for columns with different names but same content.

Regards

skan
  • 7,423
  • 14
  • 59
  • 96

1 Answers1

3

This is a common task in feature engineering. The following code chunk was developed by myself and the community on Kaggle for just this purpose:

##### Removing identical features
features_pair <- combn(names(train), 2, simplify = F) # list all column pairs
toRemove <- c() # init a vector to store duplicates
for(pair in features_pair) { # put the pairs for testing into temp objects
  f1 <- pair[1]
  f2 <- pair[2]

  if (!(f1 %in% toRemove) & !(f2 %in% toRemove)) {
    if (all(train[[f1]] == train[[f2]])) { # test for duplicates
      cat(f1, "and", f2, "are equals.\n")
      toRemove <- c(toRemove, f2) # build the list of duplicates
    }
  }
}

Then you can just drop whichever copy of the duplicates you want. By default I use the version stored in the temporary object f2 and remove them like this:

train <- train[,!toRemove]
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • 1
    I would sure appreciate a comment to explain the downvote. This is one of my most useful and frequently used code chunks. – Hack-R Jul 06 '16 at 11:17
  • 4
    I didn't downvote your answer ot the OP but your code seems quite long where it could be done using `dat[!duplicated(as.list(dat))]` for data.frame or `dat[, !duplicated(as.list(dat)), with = FALSE]` for data.table – talat Jul 06 '16 at 11:23
  • 2
    just curious, have you compared speed (using `data.table` ...) with the linked answer above (cf @docendodismus's comment)? I would encapsulate the comparison function and use it as the `FUN` argument to `combn()`, but otherwise, nice answer. – Ben Bolker Jul 06 '16 at 11:24
  • 1
    I see a few reasons to downvote: you're growing something inside a loop, which probably doesn't hurt much in this case but is a bad example to give the OP; you write `!toRemove` which does not work, since `toRemove` is not a logical vector; and you use `F` instead of FALSE which is pretty minor (as all of these are), but not a good idea for code you're lending to someone else verbatim. Anyway, I like that the code is readable with minimal knowledge of R. – Frank Jul 06 '16 at 12:39
  • @docendodiscimus Why do you need to convert it to a list? – skan Jul 06 '16 at 22:00
  • @docendodiscimus Why not use unique instead of !duplicated . temp[,unique(as.list(temp))] – skan Jul 08 '16 at 11:08
  • Hiw can I know not only if it's duplicated but also whichi is the name of the original one? – skan Jul 08 '16 at 11:56
  • @docendodiscimus Is it better to reassign your way to the variable or doing directly this dat[, which(duplicated(names(dat))) := NULL] ? – skan Jul 18 '16 at 23:06