Sum all columns whose names start with a pattern, by group

Question

I'm fairly new to R and I'm trying to sum columns by groups based on their names. I have a data frame like this one:

DT <- data.frame(a011=c(0,10,20,0),a012=c(010,10,0,0),a013=c(10,30,0,10),
a021=c(10,20,20,10),a022=c(0,0,0,10),a023=c(20,0,0,0),a031=c(30,0,10,0),
a032=c(0,0,10,0),a033=c(20,0,0,0))

I would like to obtain the sum of all the columns starting with "a01", of all the columns starting with "a02" and all the columns starting with "a03":

a01tot a02tot a03tot
    20     30     50
    50     20      0
    20     20     20
    10     20      0

So far I have used

DT$a01tot <- rowSums(DT[,grep("a01", names(DT))])

and so on, but my real data frame has many more groups and I would like to avoid having to write a line of code for each group. I was wondering if it is possible to include "a01","a02","a03"... in a vector or list and have something that adds the columns "a01tot","a02tot","a03tot"... to the data frame automatically.

I know that my question is very similar to this one: R sum of rows for different group of columns that start with similar string, but the solution pointed out there,

cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))

does not work in my case because there isn't a common element (like "_") to replace (I cannot change the names of the variables to a01_1, a02_2 etc.).

Switching to the "long" format is not a viable solution in my case either.

Any help will be greatly appreciated.

Jota · Accepted Answer · 2015-08-18T12:59:53.803

10

You can store the patterns in a vector and loop through them. With your example you can use something like this:

patterns <- unique(substr(names(DT), 1, 3))  # store patterns in a vector
new <- sapply(patterns, function(xx) rowSums(DT[,grep(xx, names(DT)), drop=FALSE]))  # loop through
#     a01 a02 a03
#[1,]  20  30  50
#[2,]  50  20   0
#[3,]  20  20  20
#[4,]  10  20   0

You can adjust the names like this:

colnames(new) <- paste0(colnames(new), "tot")  # rename

edited Aug 18 '15 at 12:59

answered Aug 17 '15 at 14:21

Jota

17,281
7
63
93

Thank you, this works just fine with my example. When I try to apply it to the real data, though, I get an error in rowSums: " 'x' must be an array of at least two dimensions". This seems to be because for one of the "groups" there is only one column, so rowSums gets a vector instead of a data frame. How could I solve this? – Astarte Aug 18 '15 at 09:55
@Astarte you can get around that issue by adding `drop = FALSE`. I edited the solution above. – Jota Aug 18 '15 at 12:59

score 2 · Answer 2 · answered Aug 17 '15 at 15:32

2

Another possible solution

library(dplyr)
library(reshape2) 
library(tidyr)

DT %>%
  mutate(id = 1:n()) %>% 
  melt(id.vars = c('id')) %>% 
  mutate(Group = substr(variable, 1, 3)) %>% 
  group_by(id, Group) %>% 
  summarise(tot = sum(value)) %>% 
  spread(Group, tot) %>% 
  select(-id)

Results

Source: local data frame [4 x 3]

  a01 a02 a03
1  20  30  50
2  50  20   0
3  20  20  20
4  10  20   0

Then as @Jota suggests colnames(new) <- paste0(colnames(new), "tot")

answered Aug 17 '15 at 15:32

giac

4,261
5
30
59

May I recommend `gather()` from `tidyr` instead of `melt`. Reshape2 is a thing of the past :) – Felipe Gerard Jun 29 '18 at 15:20
@FelipeGerard Actually I was just using `reshape2` again and it is still amazing when you are reshaping a lot of variables. It is very good at guessing names and so on. I think it's still in the game! – giac Jun 29 '18 at 22:47
1

I don't doubt it! Have you used `tidyr`'s `select`-like selectors? They're amazing! – Felipe Gerard Jul 02 '18 at 03:29

Sum all columns whose names start with a pattern, by group

2 Answers2

Linked