This is a common task in feature engineering. The following code chunk was developed by myself and the community on Kaggle for just this purpose:
##### Removing identical features
features_pair <- combn(names(train), 2, simplify = F) # list all column pairs
toRemove <- c() # init a vector to store duplicates
for(pair in features_pair) { # put the pairs for testing into temp objects
f1 <- pair[1]
f2 <- pair[2]
if (!(f1 %in% toRemove) & !(f2 %in% toRemove)) {
if (all(train[[f1]] == train[[f2]])) { # test for duplicates
cat(f1, "and", f2, "are equals.\n")
toRemove <- c(toRemove, f2) # build the list of duplicates
}
}
}
Then you can just drop whichever copy of the duplicates you want. By default I use the version stored in the temporary object f2
and remove them like this:
train <- train[,!toRemove]