Identify and match duplicate columns

Question

I've got a dataset with columns of values, and need to find if any columns are matching in every row. I've been able to find which columns have a duplicate using:

S1 <- c(80,80,0,0,100)
S2 <- c(90,90,0,0,100)
S3 <- c(80,80,0,0,100)
eg <- data.frame(S1,S2,S3)
duplicated(t(eg))

where S1 and S3 are identical. This returns:

S1 = FALSE
S2 = FALSE
S3 = TRUE

So I know S3 is a duplicate of either S1 or S2, but I'm unsure which. Is there any terms I can add to help me determine which columns are identical to one another? Something like this:

S1 = S3
S2 = Unique
S3 = S1

Thank you!

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jul 04 '20 at 19:05
Thank you for the response! I have added in a (hopefully) useful sample. — kaigg, Jul 04 '20 at 19:31
What outcome do you want if there is more than one duplicate? For example, if an S4 was also a duplicate of an S1. Or do you only have the three columns? — Michelle, Jul 04 '20 at 19:55
I have many columns, and I would need to know which three (or four...) are identical as well — kaigg, Jul 04 '20 at 20:00

Count Orlok · Accepted Answer · 2020-07-04T20:29:49.063

The following function returns a list of duplicate column groups:

get_identical_columns <- function(df) {
  cols_with_duplicates <- names(which(duplicated(t(df))))
  duplicate_column_groups <- lapply(cols_with_duplicates, function(column) {
                               names(which(apply(df, 2, identical, df[, column])))
                             })
  unique(duplicate_column_groups)
}

Say you have the following data in a data frame called df:

Columns a and b are identical to each other, as are columns c, d, and e. Running the above function will give you a list of the groups of column names that are identical to each other:

get_identical_columns(df)

# [[1]]
# [1] "a" "b"

# [[2]]
# [1] "c" "d" "e"

The function first finds the names of which columns have duplicates, identified using your duplicated(t(df)) method. We then use lapply so that for each one of those column names, we find all columns identical to it (including itself). The call to unique is to remove repeated results in the situation when there are 3 or more duplicate columns.

This is exactly what I needed, thank you so much! (FYI, in case anyone else is trying it, the bottom bit of code should read 'get_identical_columns' instead of 'get_duplicate_columns'. — kaigg, Jul 04 '20 at 19:55
@kaigg I'm glad that helped, and thanks for pointing out the typo - fixed it straight away. — Count Orlok, Jul 04 '20 at 20:30

akrun · Answer 2 · 2020-07-04T19:58:46.537

0

We can use duplicated on the transpose

duplicated(t(eg))
#   S1    S2    S3 
#FALSE FALSE  TRUE

If we need the column names of duplicated columns

eg1 <- t(eg)
i1 <- duplicated(eg1)|duplicated(eg1, fromLast = TRUE)
names(i1)[i1]

Or use split

split(names(i1), i1)
#$`FALSE`
#[1] "S2"

#$`TRUE`
#[1] "S1" "S3"

edited Jul 04 '20 at 19:58

answered Jul 04 '20 at 19:37

akrun

874,273
37
540
662

al-obrien · Answer 3 · 2020-07-04T19:44:50.543

By creating a dummy dataset, one can cross the columns to compare and check to see if all the rows are the same through a purrr loop. These can then be extracted from the dataset to have a version with only the duplicate columns.

# Load library
library(tidyverse)

# Create data that has one column with all the same rows
test_data <- tibble(col1 = rnorm(100), col2 = runif(1), col3 = runif(1))
test_data <- test_data %>% mutate(col4 = col2)

# Create combinations of rows to compare
combination1 <- combn(names(test_data), 2, simplify = T)[1,]
combination2 <- combn(names(test_data), 2, simplify = T)[2,]

# Input of combinations for loop
input <- list(combination1, combination2)

# Index list where combinations identical
index <- pmap(input, ~identical(test_data[[..1]], test_data[[..2]]))

# Pull data that are identical, with their specific names
identical <- tibble(!!sym(combination1[index==T]) := test_data[[combination1[index==T]]],
                    !!sym(combination2[index==T]) := test_data[[combination2[index==T]]])

identical

This will return a dataset with the duplicated columns. Although there could be more efficient/simpler ways to do this, I believe this example can likely be expanded to your specific use case.

# A tibble: 100 x 2
    col2  col4
   <dbl> <dbl>
 1 0.901 0.901
 2 0.901 0.901
 3 0.901 0.901
 4 0.901 0.901
 5 0.901 0.901
 6 0.901 0.901
 7 0.901 0.901
 8 0.901 0.901
 9 0.901 0.901
10 0.901 0.901
# ... with 90 more rows

Identify and match duplicate columns

3 Answers3