Merge duplicated column names

Question

I have a data frame where some columns have the same data, but different column names. I would like to remove duplicated columns, but merge the column names. An example, where test1 and test4 columns are duplicates:

df

      test1 test2 test3 test4
    1     1     1     0     1
    2     2     2     2     2
    3     3     4     4     3
    4     4     4     4     4
    5     5     5     5     5
    6     6     6     6     6

and I would like the result to be something like this:

df

      test1+test4 test2 test3 
    1           1     1     0     
    2           2     2     2     
    3           3     4     4     
    4           4     4     4     
    5           5     5     5     
    6           6     6     6

Here is the data:

structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

Please note that I do not simply want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed.

I could do it manually for the simple table I posted, but I want to use this on large datasets, where I don't know in advance what columns are identical. I do not what to remove and rename columns manually, since I might have over 50 duplicated columns.

We have to assume that you googled "r remove duplicate columns". Please clarify why the first few hits were not helpful. Otherwise this question will be closed as a duplicate. — Henrik, Mar 27 '17 at 17:32
Yes, I have. Please look at the column names in the result table. I do not only want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed. I could do it manually for the simple table I posted, but I want to use this on large datasets. — arielle, Mar 27 '17 at 17:34
Do you know in advance which columns are duplicate? Or you'd like that to be determined automatically — MichaelChirico, Mar 27 '17 at 17:34
I want it do be determined automatically. I guess something like this would work: duplicated(t(df)) — arielle, Mar 27 '17 at 17:35
or to get the column names of duplicated columns: colnames(df)[duplicated(t(df))]. The issue is that this only gets one of the member of duplicate pairs, and not both — arielle, Mar 27 '17 at 17:37
I am not sure why my post was edited @Imo, but I want to do this in R. Why did you remove "R:" from the title? — arielle, Mar 27 '17 at 18:14
You already have the R tag. That is sufficient as people who follow that tag will be able to find your post. — lmo, Mar 27 '17 at 18:32
`match` and `unique` can handle "list"s (here, "data.frame"), so `match(df, unique.default(df))` can provide a mapping for identical columns as an easy start. How large are your data? How many unique values are there? — alexis_laz, Mar 27 '17 at 18:43
My data is around 1500 columns and 15000 rows. There are about 50 duplicated columns I would say — arielle, Mar 27 '17 at 18:50

score 1 · Accepted Answer · edited May 23 '17 at 12:10

Ok, improving on the above answer using the idea from here. Save the duplicate and non-duplicate columns into data frames. Check to see if the non-duplicates match any duplicates, and if so concatenate their columns names. So this will now work if you have more than two duplicate columns.

Editted: Changed summary to digest. This helps with character data.

df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

library(digest)
nondups <- df[!duplicated(lapply(df, digest))]

dups <- df[duplicated(lapply(df, digest))]

for(i in 1:ncol(nondups)){
  for(j in 1:ncol(dups)){
    if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
    else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
  }
}

nondups

Example 2, as a function.

Editted: Changed summary to digest and return non-duplicated and duplicated data frames.

age <- 18:29
height <- c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender <- c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe <- data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender, gender3 = gender)

dupcols <- function(df = testframe){
  nondups <- df[!duplicated(lapply(df, digest))]

  dups <- df[duplicated(lapply(df, digest))]

  for(i in 1:ncol(nondups)){
    for(j in 1:ncol(dups)){
      if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
      else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
    }
  }

  return(list(df1 = nondups, df2 = dups))
}

dupcols(df = testframe)

Editted: This section is new.

Example 3: On a large data frame

#Creating a 1500 column by 15000 row data frame
dat <- do.call(data.frame, replicate(1500, rep(FALSE, 15000), simplify=FALSE))
names(dat) <- 1:1500

#Fill the data frame with LETTERS across the rows
#This part may take a while. Took my PC about 23 minutes.
start <- Sys.time()
  fill <- rep(LETTERS, times = ceiling((15000*1500)/26))
  j <- 0
  for(i in 1:nrow(dat)){
    dat[i,] <- fill[(1+j):(1500+j)]
    j <- j + 1500
  }
difftime(Sys.time(), start, "mins")

#Run the function on the created data set
#This took about 4 minutes to complete on my PC.
start <- Sys.time()
  result <- dupcols(df = dat)
difftime(Sys.time(), start, "mins")

names(result$df1)
ncol(result$df1)
ncol(result$df2)

I am guessing that this could take a while to run for very large data frames, for example 15000 by 1500? — arielle, Mar 27 '17 at 18:59
Test it. Using the example I provided, and replicating the data frame a large number of times, it still works pretty quickly. `dfnew <-do.call("data.frame", replicate(500, testframe, simplify = FALSE)); ncol(dfnew); start <- Sys.time(); result <- dupcols(df = dfnew); difftime(Sys.time(), start, "secs");` The column names get pretty unwieldy though. — Jake, Mar 27 '17 at 19:08
Ok, I tested it on a big data frame and it has been running for at least 15 min, so I must have done something wrong, I will try to figure it out. — arielle, Mar 27 '17 at 19:12
I edited my answer to improve upon the function and to provide an example on a large data frame, so it's more related to how you were hoping to use it. — Jake, Mar 28 '17 at 11:47
Thank you! In the end your previous version worked well, I just had to tweak something, but this one is interesting too. Thank you for the different versions! — arielle, Mar 29 '17 at 13:30

score 0 · Answer 2 · answered Mar 27 '17 at 17:55

It's not completely automated, but the output of the loop will identify pairs of duplicate columns. You'll then have to remove one of the duplicate columns and then re-name based on what columns were duplicates.

df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

for(i in 1:(ncol(df)-1)){
  for(j in 2:ncol(df)){
    if(i == j) NULL
    else if(FALSE %in% paste0(df[,i] == df[,j])) NULL
    else print(paste(i, j, sep = " + "))
  }
}

new <- df[,-4]
names(new)[1] <- paste(names(df[1]), names(df[4]), sep = "+")
new

This seems to be a good start, but it does not work if there are more than two columns with identical data, since it will look for all possible pairs... — arielle, Mar 27 '17 at 17:59
And I am really looking for a way to not have to remove and rename columns manually, since I might have over 50 duplicated columns — arielle, Mar 27 '17 at 18:00

Merge duplicated column names

2 Answers2