Consolidate duplicate columns and concatenate column names

Question

I'm sure this must be a duplicate. Say I have this:

stage <- seq(1,3,1)
exp1 <- c("A","B","C")
exp2 <- c("A","B","C")
exp3 <- c(NA, "B","C")
exp4 <- c("D","B","C")
exp5 <- c("A","B","C")
exp6 <- c(NA, "B","C")

df <- data.frame(stage, exp1, exp2, exp3, exp4, exp5, exp6)

  stage exp1 exp2 exp3 exp4 exp5 exp6
1     1    A    A <NA>    D    A <NA>
2     2    B    B    B    B    B    B
3     3    C    C    C    C    C    C

I want to combine all duplicated columns and show this by concatenating the column names. I can find duplicated columns with:

df[duplicated(lapply(df, summary))]

exp2 exp5 exp6
1    A    A <NA>
2    B    B    B
3    C    C    C

But I can't figure out how to combine the duplicated column names such that I get something like this:

  stage exp1_exp2_exp5 exp3_exp6 exp4
1     1              A      <NA>    D
2     2              B         B    B
3     3              C         C    C

Perhaps (probably) this is a poor way to visualize a comparison between different "exp"s? Maybe I need to reshape to show this more clearly?

How is `exp3` duplicate of `exp5`, but not of `exp1` or `exp2`? — pogibas, Sep 06 '17 at 11:22

score 6 · Accepted Answer · answered Sep 06 '17 at 11:34

With base R, you can use match on the unique set of vectors in the data.frame to get the groupings, feed this to split, with the names of the variables as the first argument to break up the names by group, use sapplyand paste to concatenate the variable names in each group, and then use setNames to provide the names to the unique set of columns.

setNames(as.data.frame(unique(as.list(df))),
         sapply(split(names(df), match(as.list(df), unique(as.list(df)))),
                paste, collapse="-"))

which returns

  stage exp1-exp2-exp5 exp3-exp6 exp4
1     1              A      <NA>    D
2     2              B         B    B
3     3              C         C    C

Jaap · Answer 2 · 2017-09-06T11:45:46.720

4

Looking for something like this?

library(dplyr)
library(tidyr)
df %>% 
  gather(variable, value, -1) %>% 
  group_by(variable) %>% 
  summarise(values = paste(sort(value), collapse = ',')) %>% 
  group_by(values) %>% 
  summarise(cols = paste(variable, collapse = '_')) %>% 
  separate_rows(values) %>% 
  left_join(df %>% gather(variable, value, -1, na.rm = TRUE), ., by = c('value'='values')) %>% 
  select(-variable) %>% 
  distinct() %>% 
  spread(cols, value)

which gives:

  stage exp1_exp2_exp5 exp3_exp6 exp4
1     1              A      <NA>    D
2     2              B         B    B
3     3              C         C    C

edited Sep 06 '17 at 11:45

answered Sep 06 '17 at 11:26

Jaap

81,064
34
182
193

Although this is not exactly what I was after it does provide a different approach which will be very useful to me. So thank you. – Pete900 Sep 06 '17 at 11:37
Ah it is now. However, I still like the first answer that you gave because it groups "exp"s by common groups of letters. I can still get code from the edits page. – Pete900 Sep 06 '17 at 11:49
Updated. Now it gives the correct output, though @lmo's solution is a lot shorter. – Jaap Sep 06 '17 at 11:49
1

@Pete900 If you run my code step by step, you'll see that it groups *"exp"*s by common groups of letters as well ;-) – Jaap Sep 06 '17 at 11:52

Consolidate duplicate columns and concatenate column names

2 Answers2