I am trying to aggregate a dataset of 12.000 obs. with 37 variables in which I want to group by 2 variables and sum by 1.
All other variables must remain, as these contain important information for later analysis.
Most remaining variables contain the same value within the group, from others I would want to select the first value.
To get a better feeling of what is happening, I created a random small test dataset (10 obs. 5 variables).
row <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
set1 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3)
set2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 2, 1)
set3 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
df <- data.frame(row, y, set1, set2, set3)
df
row y set1 set2 set3
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 2
4 4 2 1 2 2
5 5 2 1 2 3
6 6 2 1 2 3
7 7 3 2 1 4
8 8 3 2 1 4
9 9 3 2 2 5
10 10 4 3 1 5
I want to aggregate the data based on set1 and set2, getting sum(y)-values, whilst keeping the other columns (here row and set3) by selecting the first value within the remaining columns, resulting in the following aggregated dataframe (or tibble):
# row y set1 set2 set3
# 1 3 1 1 1
# 4 6 1 2 2
# 7 6 2 1 4
# 9 3 2 2 5
# 10 4 3 1 5
I have checked other questions for a possible solution, but have not been able to solve mine.
The most important questions and websites I have looked into and tried are:
Combine rows and sum their values
https://community.rstudio.com/t/combine-rows-and-sum-values/41963
https://datascienceplus.com/aggregate-data-frame-r/
R: How to aggregate some columns while keeping other columns
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?
I have figured out that using summarise
in dplyr
always results in removal of remaining variables.
I thought to have found a solution with R: How to aggregate some columns while keeping other columns as reproducing the example gave satisfying results.
As using
library(dplyr)
df_aggr1 <-
df %>%
group_by(set1, set2) %>%
slice(which.max(y))
Resulted in
# A tibble: 5 x 5
# Groups: set1, set2 [5]
row y set1 set2 set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 4 2 1 2 2
3 7 3 2 1 4
4 9 3 2 2 5
5 10 4 3 1 5
However, using
library(dplyr)
df_aggr2 <-
df %>%
group_by(set1, set2) %>%
slice(sum(y))
resulted in:
# A tibble: 1 x 5
# Groups: set1, set2 [1]
row y set1 set2 set3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 1 1 1 2
In which y apparently is not even summed, so I do not get what is happening.
What am I missing?
Thanks in advance!