Joel Wilson's code was just about there, but focused on the unique columns not the duplicate ones. My answer here is Joel's code slightly modified to instead focus on the duplicated columns.
# two functions
df2 <- sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(paste(x, "$", sep=""), colnames(df))]))
df2 <- cbind(df2, df[,!duplicated(colnames(df)) & !duplicated(colnames(df), fromLast = TRUE)])
# one long function
df2 <- cbind(sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(paste(x, "$", sep=""), colnames(df))])), df[,!duplicated(colnames(df)) & !duplicated(colnames(df), fromLast = TRUE)])
Explanation
Building up to the answer, first view column names:
colnames(df)
Create logical vector indicating columns that are duplicates:
duplicated(colnames(df))
Return column names of duplicated columns names (sounds roundabout but the logical vector is selecting the duplicated columns and then colnames
returns the names):
colnames(df)[duplicated(colnames(df))]
Wrap that with the unique
function to return a single copy of each of the duplicated columns:
unique(colnames(df)[duplicated(colnames(df))])
Use this code in place of Joel Wilson's column names code unique(colnames(A))
:
sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(x, colnames(df))]))
What we are doing here is creating a vector with the names of the duplicated columns then applying a function iteratively across those names. For each column name, R searches the data frame and selects the columns that have that name and sums their rows.
Finally, assign this function to a new data frame and add back in the columns that were not summed (the columns that did not have duplicate names).
df2 <- sapply(unique(colnames(df)[duplicated(colnames(df))]), function(x) rowSums(df[,grepl(paste(x, "$", sep=""), colnames(df))]))
df2 <- cbind(df2, df[,!duplicated(colnames(df)) & !duplicated(colnames(df), fromLast = TRUE)])
Edit
I wasn't aware that the duplicate
function marks the first occurrence of a duplicated name as FALSE
. I found this post helpful in debugging my answer so that the first occurrence of duplicated columns was not included in the final dataset.
Edit 2
Using this code in practice, I found that the regular expression search summed columns that were subsets of each other. For example, if there were columns named OTU_3, OTU_35, and OTU_301, all columns named OTU_35 would be collapsed and summed, all columns named OTU_301 would be collapsed and summed, but OTU_3 would be the sum of all columns named OTU_3, OTU_35, and OTU_301. Changing the regular expression from x
to paste(x, "$", sep="")
resolves this issue. The dollar sign indicates that there shouldn't be any other characters after x for a successful match.