2

Suppose I have the following dataframe :

df <- data.frame(A=c(1,2,3),B=c("a","b","c"),C=c(2,1,3),D=c(1,2,3),E=c("a","b","c"),F=c(1,2,3))

> df
  A B C D E F
1 1 a 2 1 a 1
2 2 b 1 2 b 2
3 3 c 3 3 c 3

I want to filter out the columns that are identical. I know that I can do it with

DuplCols <- df[duplicated(as.list(df))]
UniqueCols <- df[ ! duplicated(as.list(df))]

In the real world my dataframe has more than 500 columns and I do not know how many identical columns of the same kind I have and I do not know the names of the columns. However, each columnname is unique (as in df). My desired result is (optimally) a dataframe where in each row the column names of the identical columns of one kind are stored. The number of columns in the DesiredResult dataframe is the maximal number of identical columns of one kind in the original dataframe and if there are less identical columns of another kind, NA should be stored:

> DesiredResult
  X1   X2   X3
1  A    D    F
2  B    E   NA
3  C   NA   NA

(With "identical column of the same kind" I mean the following: in df the columns A, D, F are identical columns of the same kind and B, E are identical columns of the same kind.)

user29184
  • 45
  • 3

1 Answers1

1

You can use unique and then test with %in% where it matches to extract the colname.

tt_lapply(unique(as.list(df)), function(x) {colnames(df)[as.list(df) %in% list(x)]})
tt
#[[1]]
#[1] "A" "D" "F"
#
#[[2]]
#[1] "B" "E"
#
#[[3]]
#[1] "C"

t(sapply(tt, "length<-", max(lengths(tt)))) #As data.frame
#     [,1] [,2] [,3]
#[1,] "A"  "D"  "F" 
#[2,] "B"  "E"  NA  
#[3,] "C"  NA   NA  
GKi
  • 37,245
  • 2
  • 26
  • 48