Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?

Question

index   type.x  type.y   col3   col4
1        a        m      20      25
2        b        m      30      28
3        a        m      15      555
3        a        n      20      555
4        a        m      666     10
4        b        m      666     20

I have tried aggregate keeping the index and group_by without success when I try to get this shape:

index   col3   col4
1        20      25
2        30      28
3        35      555
4        666     30

Could you please precisely define what kind of aggregation you wish, because now we can only guess when you sum up the values and when you don't. — Volokh, Nov 29 '19 at 15:06

score 3 · Accepted Answer · answered Nov 29 '19 at 15:26

3

If you are using base R, the following code may help

r <- aggregate(df[4:5],by = df[1],function(v) sum(unique(v)))

which gives

> r
  index col3 col4
1     1   20   25
2     2   30   28
3     3   35  555
4     4  666   30

answered Nov 29 '19 at 15:26

ThomasIsCoding

96,636
9
24
81

Did u mean that there is a fastest or simplest way to doy with another package? Thanks a lot Thomas, that's a nice way. – StivJ Nov 29 '19 at 15:30
@OrlandoStivenJaramilloPiza I have no idea if `base R` is more efficient than other packages, since `aggregate` is powerful enough to solve the problem, so I think no need to use functions from other packages – ThomasIsCoding Nov 29 '19 at 15:34
Alright, may I ask how the "function(v) sum(unique(v)" works? I think is an anonymous function but I don't get well how it works with the "unique" function for the aggregation part. I will read any docs you have. Thanks again. – StivJ Nov 29 '19 at 15:56
@OrlandoStivenJaramilloPiza `sum(unique(v))` works like this: for each grouped values, it removes duplicated and then sum up – ThomasIsCoding Nov 29 '19 at 15:58
Is there any book or docs where I can learn to do different functions in this "function(v) sum(unique(v)" way? thanks for the 1000th time! – StivJ Nov 29 '19 at 16:35
@OrlandoStivenJaramilloPiza maybe this could do https://cran.r-project.org/index.html – ThomasIsCoding Nov 29 '19 at 18:40
I don´t find that specific part – StivJ Dec 02 '19 at 12:31
@OrlandoStivenJaramilloPiza well....I cannot show you where it exactly is, but I believe you will know it after reading many R programming tutorials....I myself learnt it via that way – ThomasIsCoding Dec 02 '19 at 12:35

A. Suliman · Answer 2 · 2019-11-29T15:51:15.227

2

I assume you want the 1st element if they are similar otherwise the sum

library(dplyr)
df %>% 
   group_by(index) %>% 
   #n_distinct = length(unique)
   #Or using @Thomas's idea list(~sum(unique(.), na.rm = TRUE))
   summarise_at(vars(col3,col4), list(~if_else(n_distinct(.)==1, .[1], sum(., na.rm=TRUE))))

# A tibble: 4 x 3
  index  col3  col4
  <int> <int> <int>
1     1    20    25
2     2    30    28
3     3    35   555
4     4   666    30

edited Nov 29 '19 at 15:51

answered Nov 29 '19 at 14:56

A. Suliman

12,923
5
24
37

I think your code would break if col3 contains an additional row with e.g. index=4 and col3=1, then you will sum up the two 666s. (However, it is not clear which kind of aggregation is desired) – Volokh Nov 29 '19 at 15:05
@Volokh could you please provide this scenario using `dput`. Thanks – A. Suliman Nov 29 '19 at 15:08
sth like this: df <- structure(list(index = c(1, 2, 3, 3, 4, 4, 4), col3 = c(20, 30, 15, 20, 666, 666, 111), col4 = c(25, 28, 555, 555, 10, 20, 11 )), class = "data.frame", row.names = c(NA, -7L)) – Volokh Nov 29 '19 at 15:11
The problem is just for the index duplicates, wich are that way cause in other columns has diferente values for the same index, and I don't come up with a simple way. – StivJ Nov 29 '19 at 15:26

score 1 · Answer 3 · answered Nov 29 '19 at 16:04

1

We can also use

library(dplyr)
df %>% 
  group_by(index) %>%
  summarise_at(vars(starts_with('col')), ~ sum(unique(.x)))

answered Nov 29 '19 at 16:04

akrun

874,273
37
540
662

score 0 · Answer 4 · answered Nov 29 '19 at 15:14

Just assuming a similar assumption as in A. Suliman's dplyr answer (assuming you want to sum up unique values) I would suggest using data.table:

library(data.table)
my_agg_function <- function(x) {
  x <- unique(x)
  return(sum(x))
}

df[,.(col3=my_agg_function(col3),col4=my_agg_function(col4)),by=index]

Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?

4 Answers4

Linked