1

I have data spanning a couple decades that I'm reading in from yearly files and row binding. I've found that on occasion I end up with columns that have duplicate values, and I'd like to remove the duplicate columns. This has to happen over very large tables (millions of rows, hundreds of columns) so doing a pairwise check is likely not feasible.

Example data:

df <-  data.frame(id = c(1:6), x = c(15, 21, 14, 21, 14, 38), y = c(36, 38, 55, 11, 5, 18), z = c(15, 21, 14, 21, 14, 38), a = c("D", "B", "A", "F", "H", "P")) 

> df
  id  x  y  z a
1  1 15 36 15 D
2  2 21 38 21 B
3  3 14 55 14 A
4  4 21 11 21 F
5  5 14  5 14 H
6  6 38 18 38 P

z is a duplicate of x, so should be removed. Desired result:

> df2
  id  x  y a
1  1 15 36 D
2  2 21 38 B
3  3 14 55 A
4  4 21 11 F
5  5 14  5 H
6  6 38 18 P
jzadra
  • 4,012
  • 2
  • 26
  • 46

1 Answers1

1

We can apply duplicated on the transposed dataset and use it to subset the columns

df[!duplicated(t(df))]
#  id  x  y a
#1  1 15 36 D
#2  2 21 38 B
#3  3 14 55 A
#4  4 21 11 F
#5  5 14  5 H
#6  6 38 18 P
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Sometimes SO makes me feel really dumb. Thank you. – jzadra Nov 16 '18 at 22:50
  • is there a dplyr piped solution to this ? something like `df %>% t %>% unique %>% t` ? When I try it.. it convert number to factors which makes it ugly to fix data types.. – ok1more Jun 02 '20 at 03:13