I have data spanning a couple decades that I'm reading in from yearly files and row binding. I've found that on occasion I end up with columns that have duplicate values, and I'd like to remove the duplicate columns. This has to happen over very large tables (millions of rows, hundreds of columns) so doing a pairwise check is likely not feasible.
Example data:
df <- data.frame(id = c(1:6), x = c(15, 21, 14, 21, 14, 38), y = c(36, 38, 55, 11, 5, 18), z = c(15, 21, 14, 21, 14, 38), a = c("D", "B", "A", "F", "H", "P"))
> df
id x y z a
1 1 15 36 15 D
2 2 21 38 21 B
3 3 14 55 14 A
4 4 21 11 21 F
5 5 14 5 14 H
6 6 38 18 38 P
z
is a duplicate of x
, so should be removed. Desired result:
> df2
id x y a
1 1 15 36 D
2 2 21 38 B
3 3 14 55 A
4 4 21 11 F
5 5 14 5 H
6 6 38 18 P