R sum of rows for different group of columns that start with similar string

Question

I'm quite new to R and this is the first time I dare to ask a question here.

I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.

Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.

df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))

var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
           ,"sat_3","res_1","res_2","res_3","res_4","com_1",
           "com_2","com_3","com_4","com_5","cap_1","cap_2",
           "cap_3","cap_4")

names(df) <- var.names

So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.

df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])

But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.

Thanks in advance!

Well, if that's an operation you need to do often, it sounds like your data is in the wrong shape. It would be easier to work with data in the "long" format rather than the "wide" format you currently have. There are plenty of other questions out there on reshaping if that's something you want to do. — MrFlick, May 21 '15 at 21:01

David Arenburg · Accepted Answer · 2015-05-25T06:22:46.823

3

One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))

cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
#   emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1     2     4     5     3     1     2     4     5     3     1     2     4     5     3     1     2     4     5     3     1    13
# 2     1     3     4     2     5     1     3     4     2     5     1     3     4     2     5     1     3     4     2     5    14
#   com_t emp_t res_t sat_t
# 1    15    14    11     7
# 2    15    10    12     9

edited May 25 '15 at 06:22

answered May 21 '15 at 21:16

David Arenburg

91,361
17
137
196

ooohh, so close... I think I like yours better (+1) – BrodieG May 21 '15 at 21:18
@BrodieG don't know, I'm using `t` twice here, so not sure how it will scale. (+1) to you too :). Nice idea to replace with `_t` instead of just `""` btw. – David Arenburg May 21 '15 at 21:19
Thanks everyone for the responses. Until now I didn't know what regular expressions were, but since I couln't understand why the "_.*$" part worked I had to do some reading and discovered that $ is not necessary here, `cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))` works just fine. – csmontt May 24 '15 at 03:57

score 2 · Answer 2 · answered May 21 '15 at 21:16

2

Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:

cbind(
  df, 
  sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)

Will do the trick

answered May 21 '15 at 21:16

BrodieG

51,669
9
93
146

score 1 · Answer 3 · answered May 21 '15 at 21:27

You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.

library(dplyr)
library(tidyr)
df %>% 
  gather(key, value) %>% 
  extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>% 
  group_by(class) %>% 
  summarize(class_sum = sum(value))

First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:

Source: local data frame [5 x 2]

  class class_sum
1   cap        26
2   com        30
3   emp        23
4   res        22
5   sat        19

You are supposed to get two values per class here. Also, how it joins back to the original data? — David Arenburg, May 21 '15 at 21:31
I see how the original question got two answers (i.e.) one for each row. But it seems like the point is then to sum them up? If not, then there is a hidden variable here. In the original df there would need to be another column specifying what identifies each row. That column would be included in the call to `group_by()`. Merging back to the original data is easier. One could use `mutate()` in place of `summarize` to add a new column with that data. Or one could `left_join()` this new data frame back to the tidied data frame. — Lincoln Mullen, May 21 '15 at 21:37

score 0 · Answer 4 · answered Nov 18 '21 at 02:04

Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/

df %>% 
  rowwise() %>%
  mutate(emp_sum = sum(c_across(starts_with("emp"))), 
         sat_sum = sum(c_across(starts_with("sat"))),
         res_sum = sum(c_across(starts_with("res"))), 
         com_sum = sum(c_across(starts_with("com"))),
         cap_sum = sum(c_across(starts_with("cap"))))

R sum of rows for different group of columns that start with similar string

4 Answers4

Linked