Summing columns in R Data Set

Question

I have a data set that I would like to sum up all of Action_1 and Action_2 and according to USER_IDs (eventually removing duplicates in USER_ID)

User_ID    Action_1   Action_2
user_001   1          1
user_001   1          0
user_001   0          1
user_002   1          1
user_002   0          1

...

Outcome:

User_ID   Action_1   Action_2
user_001  2          2
user_002  1          2

...

Is there a way to accomplish this in R? I do apologize in advance as I do not know the terminology of this technique.

Thanks.

the "terminology" is `group_by` and `aggregate` . read about group by function in sql - it could be useful. — vagabond, Nov 18 '16 at 05:19

joel.wilson · Answer 1 · 2016-11-18T06:07:42.330

1

library(data.table)
df = fread("User_ID    Action_1   Action_2
       user_001   1          1
       user_001   1          0
       user_001   0          1
       user_002   1          1
       user_002   0          1")

#incase df was a data.frame in your code then use setDT() to make it data.table, which can help you with speed. 

df[,.(Action_1=sum(Action_1), Action_2=sum(Action_2)),"User_ID"]

#    User_ID a b
#1: user_001 2 2
#2: user_002 1 2

I f there are many columns and all are such numeric ones on which yo want to sum-up, then based on @thelatemail you

dat[, lapply(.SD,sum), by=User_ID]

If you’re using several but not all of the columns, you can combine .SD with .SDcols too!

sd.cols = c("Action_1", "Action_2")
dat[, lapply(.SD,sum), by=User_ID, .SDcols = sd.cols]

edited Nov 18 '16 at 06:07

answered Nov 18 '16 at 03:59

joel.wilson

8,243
5
28
48

2

You can simplify this substantially - `dat[, lapply(.SD,sum), by=User_ID]` for instance. – thelatemail Nov 18 '16 at 04:01
yup you can! ..to the OP : .SD refers to the subset of data after grouping by User_ID – joel.wilson Nov 18 '16 at 04:03

score 0 · Answer 2 · answered Nov 18 '16 at 04:07

This is the exact use case for the aggregate function:

aggregate(df[-1], by = df["User_ID"], FUN = sum)

The by parameter is a list of factors indicating groups in the first element that the function FUN should be applied to.

Or using the formula interface:

aggregate(. ~ User_ID, df, FUN = sum)

score 0 · Answer 3 · answered Nov 18 '16 at 04:33

Considering input as:

structure(list(User_Id = structure(c(1L, 1L, 1L, 2L, 2L), 
.Label = c("user_001", "user_002"), class = "factor"), 
Action_1 = c(1, 1, 0, 1, 0), Action_2 = c(1, 0, 1, 1, 1)), 
.Names = c("User_Id", "Action_1", "Action_2"), 
row.names = c(NA, -5L), class = "data.frame")

You can try tidyverse as:

df_test %>%
  group_by(User_Id) %>%
  summarise( Action_1 = sum(Action_1), Action_2 = sum(Action_2))

Summing columns in R Data Set

3 Answers3