How to remove duplicate columns (content) in data.table R?

Question

How do I remove duplicate columns with data.table? (keeping just one of them)

I know there are other questions about duplicate columns but they just check for duplicate column names not for the content,

What I want is to look for columns with different names but same content.

Regards

Look at this related answer: http://stackoverflow.com/a/37564270 — talat, Jul 06 '16 at 11:02
Possible duplicate of [Delete Redundant columns in R](http://stackoverflow.com/questions/37564066/delete-redundant-columns-in-r) — dww, Jul 06 '16 at 17:20
It's not the same because I'm speaking about a data.table way of achieving that result. — skan, Jul 19 '16 at 19:10

Hack-R · Answer 1 · 2016-07-06T11:24:56.440

3

This is a common task in feature engineering. The following code chunk was developed by myself and the community on Kaggle for just this purpose:

##### Removing identical features
features_pair <- combn(names(train), 2, simplify = F) # list all column pairs
toRemove <- c() # init a vector to store duplicates
for(pair in features_pair) { # put the pairs for testing into temp objects
  f1 <- pair[1]
  f2 <- pair[2]

  if (!(f1 %in% toRemove) & !(f2 %in% toRemove)) {
    if (all(train[[f1]] == train[[f2]])) { # test for duplicates
      cat(f1, "and", f2, "are equals.\n")
      toRemove <- c(toRemove, f2) # build the list of duplicates
    }
  }
}

Then you can just drop whichever copy of the duplicates you want. By default I use the version stored in the temporary object f2 and remove them like this:

train <- train[,!toRemove]

edited Jul 06 '16 at 11:24

answered Jul 06 '16 at 10:59

Hack-R

22,422
14
75
131

1

I would sure appreciate a comment to explain the downvote. This is one of my most useful and frequently used code chunks. – Hack-R Jul 06 '16 at 11:17
4

I didn't downvote your answer ot the OP but your code seems quite long where it could be done using `dat[!duplicated(as.list(dat))]` for data.frame or `dat[, !duplicated(as.list(dat)), with = FALSE]` for data.table – talat Jul 06 '16 at 11:23
2

just curious, have you compared speed (using `data.table` ...) with the linked answer above (cf @docendodismus's comment)? I would encapsulate the comparison function and use it as the `FUN` argument to `combn()`, but otherwise, nice answer. – Ben Bolker Jul 06 '16 at 11:24
1

I see a few reasons to downvote: you're growing something inside a loop, which probably doesn't hurt much in this case but is a bad example to give the OP; you write `!toRemove` which does not work, since `toRemove` is not a logical vector; and you use `F` instead of FALSE which is pretty minor (as all of these are), but not a good idea for code you're lending to someone else verbatim. Anyway, I like that the code is readable with minimal knowledge of R. – Frank Jul 06 '16 at 12:39
@docendodiscimus Why do you need to convert it to a list? – skan Jul 06 '16 at 22:00
@docendodiscimus Why not use unique instead of !duplicated . temp[,unique(as.list(temp))] – skan Jul 08 '16 at 11:08
Hiw can I know not only if it's duplicated but also whichi is the name of the original one? – skan Jul 08 '16 at 11:56
@docendodiscimus Is it better to reassign your way to the variable or doing directly this dat[, which(duplicated(names(dat))) := NULL] ? – skan Jul 18 '16 at 23:06

How to remove duplicate columns (content) in data.table R?

1 Answers1