0

I am trying to aggregate a dataset of 12.000 obs. with 37 variables in which I want to group by 2 variables and sum by 1.
All other variables must remain, as these contain important information for later analysis.
Most remaining variables contain the same value within the group, from others I would want to select the first value.
To get a better feeling of what is happening, I created a random small test dataset (10 obs. 5 variables).

row <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
set1 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3)
set2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 2, 1)
set3 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
df <- data.frame(row, y, set1, set2, set3)

df
   row y set1 set2 set3
1    1 1    1    1    1
2    2 1    1    1    1
3    3 1    1    1    2
4    4 2    1    2    2
5    5 2    1    2    3
6    6 2    1    2    3
7    7 3    2    1    4
8    8 3    2    1    4
9    9 3    2    2    5
10  10 4    3    1    5

I want to aggregate the data based on set1 and set2, getting sum(y)-values, whilst keeping the other columns (here row and set3) by selecting the first value within the remaining columns, resulting in the following aggregated dataframe (or tibble):

# row y set1  set2  set3
# 1   3 1     1     1
# 4   6 1     2     2
# 7   6 2     1     4
# 9   3 2     2     5
# 10  4 3     1     5

I have checked other questions for a possible solution, but have not been able to solve mine.
The most important questions and websites I have looked into and tried are:
Combine rows and sum their values
https://community.rstudio.com/t/combine-rows-and-sum-values/41963
https://datascienceplus.com/aggregate-data-frame-r/
R: How to aggregate some columns while keeping other columns
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?

I have figured out that using summarise in dplyr always results in removal of remaining variables.
I thought to have found a solution with R: How to aggregate some columns while keeping other columns as reproducing the example gave satisfying results.
As using

library(dplyr)
df_aggr1 <-
  df %>%
  group_by(set1, set2) %>%
  slice(which.max(y))

Resulted in

# A tibble: 5 x 5
# Groups:   set1, set2 [5]
    row     y  set1  set2  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     1
2     4     2     1     2     2
3     7     3     2     1     4
4     9     3     2     2     5
5    10     4     3     1     5

However, using

library(dplyr)
df_aggr2 <-
  df %>%
  group_by(set1, set2) %>%
  slice(sum(y))

resulted in:

# A tibble: 1 x 5
# Groups:   set1, set2 [1]
    row     y  set1  set2  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     1     1     1     2

In which y apparently is not even summed, so I do not get what is happening.

What am I missing?
Thanks in advance!

Slimke
  • 23
  • 4

1 Answers1

2

It works for me when literally specifying that you want the first value, i.e.:

library(tidyverse)
df %>%
  group_by(set1, set2) %>%
  summarize(y = sum(y),
            row = row[1],
            set3 = set3[1])

 A tibble: 5 x 5
# Groups:   set1 [3]
   set1  set2     y   row  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     3     1     1
2     1     2     6     4     2
3     2     1     6     7     4
4     2     2     3     9     5
5     3     1     4    10     5

Edit: To keep every other column without specifying, you can make use of across() and indicate that you want to apply this aggregation to every column except one.

df %>%
  group_by(set1, set2) %>%
  summarize(
    across(!y, ~ .x[1]), 
    y = sum(y)
  )

# A tibble: 5 x 5
# Groups:   set1 [3]
   set1  set2   row  set3     y
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     3
2     1     2     4     2     6
3     2     1     7     4     6
4     2     2     9     5     3
5     3     1    10     5     4
erocoar
  • 5,723
  • 3
  • 23
  • 45
  • Thank you for your quick response! I forgot to indicate that I used ```dplyr```. I first got a wrong 4 obs. result with ```dplyr``` compared with your ```tidyverse``` solution. But now it is working (though I am not sure because I used ```tidyverse``` as well if it is really selecting ```dplyr``` now). Not sure what happened there, but ```tidyverse``` seems to work anyway, so this could be the solution. Do you know of a method to select all remaining variables in one go? Otherwise I need to add 34 lines for all remaining columns. – Slimke Feb 22 '21 at 12:41
  • Hey, `tidyverse` should use the `dplyr` library by default, unless you've imported another library with the same function names afterwards. I've edited my response to show how to select all remaining variables at once :) – erocoar Feb 22 '21 at 14:13