5

I have a data frame where some consecutive columns have the same name. I need to search for these, add their values in for each row, drop one column and replace the other with their sum. without previously knowing which patterns are duplicated, possibly having to compare one column name with the following to see if there's a match.

Can someone help?

Thanks in advance.

Assu
  • 71
  • 2
  • 6

4 Answers4

7
> dfrm <- data.frame(a = 1:10, b= 1:10, cc= 1:10, dd=1:10, ee=1:10)
> names(dfrm) <- c("a", "a", "b", "b", "b")
> sapply(unique(names(dfrm)[duplicated(names(dfrm))]), 
      function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
       a  b
 [1,]  2  3
 [2,]  4  6
 [3,]  6  9
 [4,]  8 12
 [5,] 10 15
 [6,] 12 18
 [7,] 14 21
 [8,] 16 24
 [9,] 18 27
[10,] 20 30

EDIT 2: Using rowSums allows simplification of the first sapply argumentto just unique(names(dfrm)) at the expense of needing to remember to include drop=FALSE in "[":

sapply(unique(names(dfrm)), 
       function(x) rowSums( dfrm[ , grep(x, names(dfrm)), drop=FALSE]) )

To deal with NA's:

sapply(unique(names(dfrm)), 
      function(x) apply(dfrm[grep(x, names(dfrm))], 1, 
              function(y) if ( all(is.na(y)) ) {NA} else { sum(y, na.rm=TRUE) }
       )               )

(Edit note: addressed Tommy counter-example by putting unique around the names(.)[.] construction. The erroneous code was:

sapply(names(dfrm)[unique(duplicated(names(dfrm)))], 
     function(x) Reduce("+", dfrm[ , grep(x, names(dfrm))]) )
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    Doesn't work on `dfrm=data.frame(a=1:10, b=20:29, a=101:110, b=200:209, a=1001:1010, check.names=F)` – Tommy May 09 '11 at 15:21
  • @Tommy. Thanks. Points out that I should have wrapped the unique around names(dfrm)[duplicated....] Fixed. – IRTFM May 09 '11 at 15:37
  • Thanks, although I suspect rowSums would be faster. – IRTFM May 09 '11 at 16:06
  • My data has missing values. How can I change this function to give me: 1) NA if all column values in a group of variables are missing and 2) a value if there is at least one value. I tried using conditionals and check NA's using is.na() but failed. Can't use na.rm=FALSE or TRUE. – Assu May 16 '11 at 09:21
4

Here is my one liner

# transpose data frame, sum by group = rowname, transpose back.
t(rowsum(t(dfrm), group = rownames(t(dfrm))))
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • +1. Very neat, though it is worth noting that as written this would need `as.data.frame()` if that is what is wanted. – Henry May 09 '11 at 18:01
  • @Henry. Yes. It would be easy to manipulate this output into whatever format the OP wants. Since it uses `rowsum` which calls a C function, it is fast, and runs 1.5x faster on the example dataset. I suspect it would be even faster on a bigger data frame. – Ramnath May 09 '11 at 18:08
2

Some sample data.

dfr <- data.frame(
  foo = rnorm(20),
  bar = 1:20,
  bar = runif(20),
  check.names = FALSE
)

Method: Loop over unique column names; if there is only one of that name, then selecting all columns with that nme will return a vector, but if there are duplicates it will also be a data frame. Use rowSums to sum over rows. (Duh. EDIT: Not quite as 'duh' as previously thought!) lapply returns a list, which we need to reform into a data frame, and finally we fix the names. EDIT: sapply avoids the need for the last step.

unique_col_names <- unique(colnames(dfr))
new_dfr <- sapply(unique_col_names, function(name)
{
  subs <- dfr[, colnames(dfr) == name]
  if(is.data.frame(subs))
    rowSums(subs)
  else
    subs
})
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • Regarding the 'Duh': rowSums indeed sum over rows, but rowsum sum over columns ;-) – Tommy May 09 '11 at 15:15
  • thanks for the reply. I had worked with the rowSums "Duh", that was not the main issue of my problem! Also, if I may say so, I thought that "Duh" was not a suitable "function" to post in this forum!;) – Assu May 09 '11 at 15:51
2

One way is to identify duplcates using (surprise) the duplicated function, and then loop through them to calculate the sums. Here is an example:

dat.dup <- data.frame(x=1:10, x=1:10, x=1:10, y=1:10, y=1:10, z=1:10, check.names=FALSE)
dups <- unique(names(dat.dup)[duplicated(names(dat.dup))])
for (i in dups) {
dat.dup[[i]] <- rowSums(dat.dup[names(dat.dup) == i])
}
dat <- dat.dup[!duplicated(names(dat.dup))]
Ista
  • 10,139
  • 2
  • 37
  • 38
  • thanks, it was my original idea but I failed to code it. And I was not familiarized with "unique", either. – Assu May 09 '11 at 17:11