1

I have a large matrix with species as columns and sites as rows. It is a presence/absence matrix (i.e. each species can be present = 1 or absent = 0).

Some species are duplicated, but their values are not the same (i.e. the same species can be present in one record and absent in another one).

I need to merge columns, i.e. when a species is duplicated I want to keep only one record and sum all values.

For example, given this matrix:

A = matrix(c(1,0,1,1,1,1,0,0,1,1), nrow=2, ncol=5)
colnames(A)<-c("A","B","A","A","B")

   A B A A B
   1 1 1 0 1
   0 1 1 0 1

The resulting matrix that I want should be:

   A B
   2 2
   1 2

But I have 948 columns and 454 rows in my original data frame, this is just a simple example.

I have tried to transpose and then aggregate, but it is not working.

user2963185
  • 23
  • 1
  • 7
  • Check here: https://stackoverflow.com/questions/5938491/how-do-i-search-for-columns-with-same-name-add-the-column-values-and-replace-th It helped me. – Salvatore Nedia Apr 12 '20 at 08:12
  • Check here: https://stackoverflow.com/questions/5938491/how-do-i-search-for-columns-with-same-name-add-the-column-values-and-replace-th It helped me. – Salvatore Nedia Apr 12 '20 at 08:13

3 Answers3

1

it's basically a loop where we iterate through each of the unique names, then usaing grepl we extract the columns with those names and perform rowsums

sapply(unique(colnames(A)), function(x) rowSums(A[,grepl(x, colnames(A))]))
#      A B
# [1,] 2 2
# [2,] 1 2

Now coming to the problem you were facing : take the example below :

A = data.frame(c("JOEL", "WILSON"),c(1,0),c(1,1),c(1,1),c(0,0),c(1,1))
colnames(A)<-c("id","A","B","A","A","B")
#       id A B A A B
# 1   JOEL 1 1 1 0 1
# 2 WILSON 0 1 1 0 1

# assuming you have first column as factor(id) 
col <- unique(colnames(A))[-1]  # -1 means remove the first unique column name which in this case is "id" a factor column

cbind(A[1], sapply(col, function(x) rowSums(A[,grepl(x, colnames(A))])))
#      id A B
#1   JOEL 2 2
#2 WILSON 1 2

Note :I'm not subsetting the ORIGINAL data here because, when we do that, the duplicated column names now have a suffix attached to it : for e.g. :-

A1 <- A[-1]
#A1
#  A B A.1 A.2 B.1
#1 1 1   1   0   1
#2 0 1   1   0   1

Therefore, you were facing problems. I hope this helps you!

Lets go into debug mode since you still getting errors :

func <- function(x){
  w <- grepl(x, colnames(A))
  h <- A[, w]
  rowSums(h)
}  
debug(func)  
sapply(col, func)   # col is as above

Now check through the func step-by-step..

joel.wilson
  • 8,243
  • 5
  • 28
  • 48
  • Hi Joel! Thanks for this, it is looking promising, but I get this error: Error in rowSums(Diatosdef[, grepl(x, colnames(Diatosdef))]) : 'x' must be an array of at least two dimensions Called from: rowSums(Diatosdef[, grepl(x, colnames(Diatosdef))]) And here the debug: if (!is.array(x) || length(dn <- dim(x)) < 2L) stop("'x' must be an array of at least two dimensions") – user2963185 Feb 07 '17 at 21:45
  • @user2963185 did you miss to have the `,` inside `A[,...]` ? thta's important – joel.wilson Feb 08 '17 at 05:18
  • Hi Joe! I did not. My data is named "Diatosdef". Here is the code that I used. I think that it is exactly the one that you provided: sapply(unique(colnames(Diatosdef)), function(x) rowSums(Diatosdef[,grepl(x, colnames(Diatosdef))])) – user2963185 Feb 08 '17 at 06:50
  • @user2963185 that looks strange... mind sharing the `class(Diatosdef)` and `str(Diatosdef)` output.. `x` are just names and need not be of 2 dimensions... i am not sure of the cause right now... more details might be needed – joel.wilson Feb 08 '17 at 09:42
  • Hi Joel! Thanks for keeping up your help! The class is data.frame, the structure: data.frame: 454 obs. of 951 variables. First variable = Factor w/ 454 levels; rest of variables = num – user2963185 Feb 08 '17 at 10:50
  • @user2963185 if it's a `data.frame` then how is it possible to have same column names...that's just in matrices. In dataframes duplicate column names tend to have a suffix – joel.wilson Feb 08 '17 at 11:03
  • I have tried by deleting the first column and converting from data frame to matrix, but I get the same error. I did: dat<-Diatosdef[,-1] and then dat<-as.matrix(dat). When I do str(dat): num[1:454,1:948] attr(*,"dimnames")=List of 2 ..$ : NULL ..S : chr [1:948] – user2963185 Feb 08 '17 at 11:12
  • @user2963185 okay can you add the `colnames(df)` where df refers to your dataframe...just share the first 5-10 names – joel.wilson Feb 08 '17 at 11:16
  • i have edited my answer... i guess you should be able to make it up now! a workaround for you! – joel.wilson Feb 08 '17 at 11:34
  • Thanks Joel! I still get the same error :( But I REALLY appreciate your help! Thanks for taking the time to do this! – user2963185 Feb 08 '17 at 11:52
  • @user2963185 huh!! i would like to help though! did you check the colnames(df), and the values stored in `col` ? – joel.wilson Feb 08 '17 at 11:54
  • Thanks Joel! It says: Error in rowSums(h) : `x` must be an array of at least two dimensions – user2963185 Feb 08 '17 at 12:06
  • @user2963185 you don't seem to be debugging hard!! I am sharing all info i can... debug what's happening in `func` step-by-step!!!!! i know what the error is... no need to tell that again! – joel.wilson Feb 08 '17 at 12:27
  • Hi Joel! Sorry, here is everything: > sapply(col,func) debugging in: FUN(X[[i]], ...) debug en #1: { w <- grepl(x, colnames(Diatosdef)) h <- Diatosdef[, w] rowSums(h) } Browse[2]> c Error in rowSums(h) : 'x' must be an array of at least two dimensions I might be missing something. I am sorry, but as you can see I have limited experience with debugging functions. – user2963185 Feb 08 '17 at 16:54
0

I found a solution by transposing the matrix and then summing rows, following the instructions in this post: Checking duplicates, sum them and delete one row after summing

Community
  • 1
  • 1
user2963185
  • 23
  • 1
  • 7
0

Joel Wilson's code was just about there, but focused on the unique columns not the duplicate ones. My answer here is Joel's code slightly modified to instead focus on the duplicated columns.

# two functions
df2 <- sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(paste(x, "$", sep=""), colnames(df))]))
df2 <- cbind(df2, df[,!duplicated(colnames(df)) & !duplicated(colnames(df), fromLast = TRUE)])

# one long function
df2 <- cbind(sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(paste(x, "$", sep=""), colnames(df))])), df[,!duplicated(colnames(df)) & !duplicated(colnames(df), fromLast = TRUE)])

Explanation

Building up to the answer, first view column names:

colnames(df)

Create logical vector indicating columns that are duplicates:

duplicated(colnames(df))

Return column names of duplicated columns names (sounds roundabout but the logical vector is selecting the duplicated columns and then colnames returns the names):

colnames(df)[duplicated(colnames(df))]

Wrap that with the unique function to return a single copy of each of the duplicated columns:

unique(colnames(df)[duplicated(colnames(df))])

Use this code in place of Joel Wilson's column names code unique(colnames(A)):

sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(x, colnames(df))]))

What we are doing here is creating a vector with the names of the duplicated columns then applying a function iteratively across those names. For each column name, R searches the data frame and selects the columns that have that name and sums their rows.

Finally, assign this function to a new data frame and add back in the columns that were not summed (the columns that did not have duplicate names).

df2 <- sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(paste(x, "$", sep=""), colnames(df))]))
df2 <- cbind(df2, df[,!duplicated(colnames(df)) & !duplicated(colnames(df), fromLast = TRUE)])

Edit

I wasn't aware that the duplicate function marks the first occurrence of a duplicated name as FALSE. I found this post helpful in debugging my answer so that the first occurrence of duplicated columns was not included in the final dataset.

Edit 2

Using this code in practice, I found that the regular expression search summed columns that were subsets of each other. For example, if there were columns named OTU_3, OTU_35, and OTU_301, all columns named OTU_35 would be collapsed and summed, all columns named OTU_301 would be collapsed and summed, but OTU_3 would be the sum of all columns named OTU_3, OTU_35, and OTU_301. Changing the regular expression from x to paste(x, "$", sep="") resolves this issue. The dollar sign indicates that there shouldn't be any other characters after x for a successful match.