Get the sum of all duplicate rows for a each column without hard coding the column name?

Question

Hi suppose I have a table with many columns ( in the thousands) and some rows that are duplicates. What I like to do is sum any duplicates for each row and for every columns. I'm stuck because I don't want to have to hard code or loop through each column and remerge. Is there a better way to do this? Here is an example with only 3 columns for simplicity.

dat <- read.table(text='name    etc4   etc1    etc2
A       9       0       3
A       10      10       4
A       11      9       4
B       2       7       5
C       40      6       0
C       50      6       1',header=TRUE)

# I could aggregate one column at a time 
# but is there a way to do for each columns without prior hard coding?
aggregate( etc4  ~ name, data = dat, sum)

akrun · Accepted Answer · 2021-09-14T22:28:18.187

We can specify the . which signifies all the rest of the columns other than the 'name' column in aggregate

aggregate(. ~ name, data = dat, sum)
   name etc4 etc1 etc2
1    A   30   19   11
2    B    2    7    5
3    C   90   12    1

Or if we need more fine control i.e if there are other columns as well and want to avoid, either subset the data with select or use cbind

aggregate(cbind(etc1, etc2, etc4) ~ name, data = dat, sum)
name etc1 etc2 etc4
1    A   19   11   30
2    B    7    5    2
3    C   12    1   90

If we need to store the names and reuse, subset the data and then convert to matrix

cname <- c("etc4",   "etc1"  )
aggregate(as.matrix(dat[cname]) ~ name, data = dat, sum)
  name etc4 etc1
1    A   30   19
2    B    2    7
3    C   90   12

Or this may also be done in a faster way with fsum

library(collapse)
fsum(get_vars(dat, is.numeric), g = dat$name)
  etc4 etc1 etc2
A   30   19   11
B    2    7    5
C   90   12    1

wow great thanks, just a dot ".", this is very helpful. Out of curiosity for the second example, is there anyway to store the column name I want in a variable? So for example I tried `cname = c("etc4", "etc1" ) aggregate(cname ~ name, data = dat, sum)` but that fails. — Ahdee, Sep 14 '21 at 22:26

score 2 · Answer 2 · answered Sep 14 '21 at 22:16

A tidyverse approach

dat %>% 
  group_by(name) %>% 
  summarise(across(.cols = starts_with("etc"),.fns = sum))

# A tibble: 3 x 4
  name   etc4  etc1  etc2
  <chr> <int> <int> <int>
1 A        30    19    11
2 B         2     7     5
3 C        90    12     1

Get the sum of all duplicate rows for a each column without hard coding the column name?

2 Answers2