0

I have a data frame where some columns have the same data, but different column names. I would like to remove duplicated columns, but merge the column names. An example, where test1 and test4 columns are duplicates:

df

      test1 test2 test3 test4
    1     1     1     0     1
    2     2     2     2     2
    3     3     4     4     3
    4     4     4     4     4
    5     5     5     5     5
    6     6     6     6     6

and I would like the result to be something like this:

df

      test1+test4 test2 test3 
    1           1     1     0     
    2           2     2     2     
    3           3     4     4     
    4           4     4     4     
    5           5     5     5     
    6           6     6     6   

Here is the data:

structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

Please note that I do not simply want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed.

I could do it manually for the simple table I posted, but I want to use this on large datasets, where I don't know in advance what columns are identical. I do not what to remove and rename columns manually, since I might have over 50 duplicated columns.

lmo
  • 37,904
  • 9
  • 56
  • 69
arielle
  • 915
  • 1
  • 12
  • 29
  • 2
    We have to assume that you googled "r remove duplicate columns". Please clarify why the first few hits were not helpful. Otherwise this question will be closed as a duplicate. – Henrik Mar 27 '17 at 17:32
  • Yes, I have. Please look at the column names in the result table. I do not only want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed. I could do it manually for the simple table I posted, but I want to use this on large datasets. – arielle Mar 27 '17 at 17:34
  • Do you know in advance which columns are duplicate? Or you'd like that to be determined automatically – MichaelChirico Mar 27 '17 at 17:34
  • I want it do be determined automatically. I guess something like this would work: duplicated(t(df)) – arielle Mar 27 '17 at 17:35
  • or to get the column names of duplicated columns: colnames(df)[duplicated(t(df))]. The issue is that this only gets one of the member of duplicate pairs, and not both – arielle Mar 27 '17 at 17:37
  • I am not sure why my post was edited @Imo, but I want to do this in R. Why did you remove "R:" from the title? – arielle Mar 27 '17 at 18:14
  • You already have the R tag. That is sufficient as people who follow that tag will be able to find your post. – lmo Mar 27 '17 at 18:32
  • Ok, sounds good – arielle Mar 27 '17 at 18:43
  • 1
    `match` and `unique` can handle "list"s (here, "data.frame"), so `match(df, unique.default(df))` can provide a mapping for identical columns as an easy start. How large are your data? How many unique values are there? – alexis_laz Mar 27 '17 at 18:43
  • My data is around 1500 columns and 15000 rows. There are about 50 duplicated columns I would say – arielle Mar 27 '17 at 18:50

2 Answers2

1

Ok, improving on the above answer using the idea from here. Save the duplicate and non-duplicate columns into data frames. Check to see if the non-duplicates match any duplicates, and if so concatenate their columns names. So this will now work if you have more than two duplicate columns.

Editted: Changed summary to digest. This helps with character data.

df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

library(digest)
nondups <- df[!duplicated(lapply(df, digest))]

dups <- df[duplicated(lapply(df, digest))]

for(i in 1:ncol(nondups)){
  for(j in 1:ncol(dups)){
    if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
    else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
  }
}

nondups

Example 2, as a function.

Editted: Changed summary to digest and return non-duplicated and duplicated data frames.

age <- 18:29
height <- c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender <- c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe <- data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender, gender3 = gender)

dupcols <- function(df = testframe){
  nondups <- df[!duplicated(lapply(df, digest))]

  dups <- df[duplicated(lapply(df, digest))]

  for(i in 1:ncol(nondups)){
    for(j in 1:ncol(dups)){
      if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
      else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
    }
  }

  return(list(df1 = nondups, df2 = dups))
}

dupcols(df = testframe)

Editted: This section is new.

Example 3: On a large data frame

#Creating a 1500 column by 15000 row data frame
dat <- do.call(data.frame, replicate(1500, rep(FALSE, 15000), simplify=FALSE))
names(dat) <- 1:1500

#Fill the data frame with LETTERS across the rows
#This part may take a while. Took my PC about 23 minutes.
start <- Sys.time()
  fill <- rep(LETTERS, times = ceiling((15000*1500)/26))
  j <- 0
  for(i in 1:nrow(dat)){
    dat[i,] <- fill[(1+j):(1500+j)]
    j <- j + 1500
  }
difftime(Sys.time(), start, "mins")

#Run the function on the created data set
#This took about 4 minutes to complete on my PC.
start <- Sys.time()
  result <- dupcols(df = dat)
difftime(Sys.time(), start, "mins")

names(result$df1)
ncol(result$df1)
ncol(result$df2)
Community
  • 1
  • 1
Jake
  • 510
  • 11
  • 19
  • I am guessing that this could take a while to run for very large data frames, for example 15000 by 1500? – arielle Mar 27 '17 at 18:59
  • Test it. Using the example I provided, and replicating the data frame a large number of times, it still works pretty quickly. `dfnew <-do.call("data.frame", replicate(500, testframe, simplify = FALSE)); ncol(dfnew); start <- Sys.time(); result <- dupcols(df = dfnew); difftime(Sys.time(), start, "secs");` The column names get pretty unwieldy though. – Jake Mar 27 '17 at 19:08
  • Ok, I tested it on a big data frame and it has been running for at least 15 min, so I must have done something wrong, I will try to figure it out. – arielle Mar 27 '17 at 19:12
  • I edited my answer to improve upon the function and to provide an example on a large data frame, so it's more related to how you were hoping to use it. – Jake Mar 28 '17 at 11:47
  • Thank you! In the end your previous version worked well, I just had to tweak something, but this one is interesting too. Thank you for the different versions! – arielle Mar 29 '17 at 13:30
0

It's not completely automated, but the output of the loop will identify pairs of duplicate columns. You'll then have to remove one of the duplicate columns and then re-name based on what columns were duplicates.

df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4, 
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4, 
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA, 
-6L), class = "data.frame")

for(i in 1:(ncol(df)-1)){
  for(j in 2:ncol(df)){
    if(i == j) NULL
    else if(FALSE %in% paste0(df[,i] == df[,j])) NULL
    else print(paste(i, j, sep = " + "))
  }
}

new <- df[,-4]
names(new)[1] <- paste(names(df[1]), names(df[4]), sep = "+")
new
Jake
  • 510
  • 11
  • 19
  • This seems to be a good start, but it does not work if there are more than two columns with identical data, since it will look for all possible pairs... – arielle Mar 27 '17 at 17:59
  • And I am really looking for a way to not have to remove and rename columns manually, since I might have over 50 duplicated columns – arielle Mar 27 '17 at 18:00