0

I have a data frame containing multiple classes, I would like to sum those columns that have the same name and are numeric, and replace the old columns with the new sum, does anyone know a way to do this?

i.e I have a data frame like:

col1 col2  col3 col3 
char factor int int

I would like to produce

col1  col2  col3 
char factor 2int

I have previously used:

data <- as.data.frame(do.call(cbind, by(t(data),INDICES=names(data),FUN=colSums)))

However this was on a dataframe that only had numeric variables.

There are other examples on the internet but not meeting the conditions of: replacement, preserving the rest of the frame, and of being on a frame with multiple classes

Similar question: how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R

Community
  • 1
  • 1
user124123
  • 1,642
  • 7
  • 30
  • 50

1 Answers1

1

Try

dat1 <- dat #to keep a copy of the original dataset 
indx <- sapply(dat, is.numeric)#check which columns are numeric
nm1 <- which(indx)#get the numeric index of the column
indx2 <- duplicated(names(nm1))#check which among the
# integer columns are duplicated
#use `Map` after splitting the "nm1" with its "names", do the `rowSums`
dat[ nm1[!indx2]] <- Map(function(x,y) rowSums(x[y]), list(dat),
                                       split(nm1, names(nm1)))

 dat[ -nm1[indx2]]

Update

Or to make it more efficient, only take the "duplicated" and "numeric" columns while leaving the others intact. Create an "index" (indx2) of columns that are duplicated. Subset the "nm1" based on the "indx2" and then do rowSums as described above. Finally, remove the unwanted columns (duplicated ones) by using the "indx3"

 indx2 <- duplicated(names(nm1))|duplicated(names(nm1),fromLast=TRUE)
 nm2 <- nm1[indx2]
 indx3 <- duplicated(names(nm2))
 dat[nm2[!indx3]] <- Map(function(x,y) rowSums(x[y]), 
                list(dat),split(nm2, names(nm2)))
 datN <- dat[ -nm2[indx3]]
 datN
 #    col1 col2 col3 col4 col5
 #1    16   23    2   10   10
 #2    10   18   12    8   18
 #3    21   23   15    6   10
 #4    14   37    3    5   15
 #5    29   39    5    1   11
 #6    26   31   14    2   20
 #7    25   31    2    8   10
 #8    36   31   12    8    6
 #9    32   26   13    6    4
 #10   16   38    1    7    3

Checking the results

 rowSums(dat1[names(dat1) %in% 'col1'])
 #[1] 16 10 21 14 29 26 25 36 32 16
 rowSums(dat1[names(dat1) %in% 'col2'])
 #[1] 23 18 23 37 39 31 31 31 26 38

data

dat <- structure(list(col1 = c(6L, 5L, 15L, 11L, 14L, 19L, 6L, 16L, 
17L, 6L), col2 = c(13L, 8L, 14L, 14L, 7L, 19L, 4L, 1L, 11L, 3L
), col3 = structure(c(2L, 5L, 8L, 3L, 4L, 7L, 2L, 5L, 6L, 1L), .Label = c("1", 
"2", "3", "5", "12", "13", "14", "15"), class = "factor"), col2 = c(7L, 
5L, 8L, 3L, 19L, 5L, 15L, 13L, 14L, 20L), col4 = structure(c(7L, 
6L, 4L, 3L, 1L, 2L, 6L, 6L, 4L, 5L), .Label = c("1", "2", "5", 
"6", "7", "8", "10"), class = "factor"), col5 = c(10L, 18L, 10L, 
15L, 11L, 20L, 10L, 6L, 4L, 3L), col1 = c(10L, 5L, 6L, 3L, 15L, 
7L, 19L, 20L, 15L, 10L), col2 = c(3L, 5L, 1L, 20L, 13L, 7L, 12L, 
17L, 1L, 15L)), .Names = c("col1", "col2", "col3", "col2", "col4", 
"col5", "col1", "col2"), row.names = c(NA, -10L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for the solution very nice code! Do you know if it's possible to pass the reorder function to rowSums the way you have written it? – user124123 Jan 08 '15 at 13:16
  • Is it possible to modify your answer to preserve the row ordering? I made a mistake, I meant pass the 'reorder' argument to the 'rowSums' function. – user124123 Jan 08 '15 at 15:44
  • @user1987097 In the example I created, the row order is preserved. Not sure why it is not preserved – akrun Jan 08 '15 at 15:58
  • @user1987097 If you look at the code, I am not doing anything to change the row order. – akrun Jan 08 '15 at 16:08
  • The reorder argument to the rowsums function defaults to TRUE, if you don't supply it with reorder = FALSE it will do it automatically. This is problematic for datasets in which ordering is important. – user124123 Jan 08 '15 at 16:47
  • @user1987097 I couldn't find any `reorder` argument within `rowSums` From the help page, `rowSums (x, na.rm = FALSE, dims = 1)` – akrun Jan 08 '15 at 16:48
  • If that is the case, another option would be `Map(function(x,y) do.call(`+`, x[y]` (not tested) – akrun Jan 08 '15 at 16:49
  • Apologies I was looking at the "rowsum" function not "rowSums" – user124123 Jan 08 '15 at 17:50
  • @user1987097 It's okay. `rowSums` will be pretty fast. If this answers your question... – akrun Jan 08 '15 at 17:52