Optimize code to sum variables by group in R

Question

I have the following toy data:

df <- data.frame(
    Gene= c("Gene1","Gene1","Gene2","Gene3"),
    gene_1 = c(1,9,0,6),
    gene_2 = c(12,1,0,11)
)

I want to group by gene name and sum the value of other columns if they are duplicated.

I use the following code to complete this task, but I cannot use it for my actual data because it is quite large and the following code is very slow.

df <- df %>% 
    group_by(Gene) %>% 
    summarise(across(everything(), sum)) %>%
    ungroup()

Is there other, less computationally expensive, ways to complete this task? Thank you.

Take a look here: https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group. You have a specific answer for large data sets [here](https://stackoverflow.com/a/18686783/13460602). — Maël, Jun 07 '22 at 11:57

score 2 · Answer 1 · answered Jun 07 '22 at 11:57

2

Try rowsums which is specialized in summing up per group.

rowsum(df[-1], df[,1])
#      gene_1 gene_2
#Gene1     10     13
#Gene2      0      0
#Gene3      6     11

answered Jun 07 '22 at 11:57

GKi

1 Answers1