Remove duplicated columns

Question

I have data spanning a couple decades that I'm reading in from yearly files and row binding. I've found that on occasion I end up with columns that have duplicate values, and I'd like to remove the duplicate columns. This has to happen over very large tables (millions of rows, hundreds of columns) so doing a pairwise check is likely not feasible.

Example data:

df <-  data.frame(id = c(1:6), x = c(15, 21, 14, 21, 14, 38), y = c(36, 38, 55, 11, 5, 18), z = c(15, 21, 14, 21, 14, 38), a = c("D", "B", "A", "F", "H", "P")) 

> df
  id  x  y  z a
1  1 15 36 15 D
2  2 21 38 21 B
3  3 14 55 14 A
4  4 21 11 21 F
5  5 14  5 14 H
6  6 38 18 38 P

z is a duplicate of x, so should be removed. Desired result:

> df2
  id  x  y a
1  1 15 36 D
2  2 21 38 B
3  3 14 55 A
4  4 21 11 F
5  5 14  5 H
6  6 38 18 P

score 1 · Accepted Answer · answered Nov 16 '18 at 17:51

1

We can apply duplicated on the transposed dataset and use it to subset the columns

df[!duplicated(t(df))]
#  id  x  y a
#1  1 15 36 D
#2  2 21 38 B
#3  3 14 55 A
#4  4 21 11 F
#5  5 14  5 H
#6  6 38 18 P

answered Nov 16 '18 at 17:51

akrun

874,273
37
540
662

Sometimes SO makes me feel really dumb. Thank you. – jzadra Nov 16 '18 at 22:50
is there a dplyr piped solution to this ? something like `df %>% t %>% unique %>% t` ? When I try it.. it convert number to factors which makes it ugly to fix data types.. – ok1more Jun 02 '20 at 03:13

Remove duplicated columns

1 Answers1