Summarize data by group and column name

Question

I have the following data frame

library(tidyverse)    
ID <- c('A','A','B','C','D','E','F')
Level1 <- c(20,50,30,10,15,10,NA)
Level2 <- c(40,33,84,NA,20,1,NA)
Level3 <- c(60,40,60,10,25,NA,NA)
Grade1 <- c(20,50,30,10,15,10,NA)
Grade2 <- c(40,33,84,NA,20,1,NA)

DF <- data.frame(ID,Level1,Level2,Level3,Grade1,Grade2)
  ID Level1 Level2 Level3 Grade1 Grade2
1  A     20     40     60     20     40
2  A     50     33     40     50     33
3  B     30     84     60     30     84
4  C     10     NA     10     10     NA
5  D     15     20     25     15     20
6  E     10      1     NA     10      1
7  F     NA     NA     NA     NA     NA

My goal is to group the data by ID, summarize the columns with columnname containing the string "Level" by calculating the mean value. Ideally, the output should look something like this

ID        mean (Level1+Level2+Level3)
A         40.5
B         58
C         10
....

Here is my code

DF %>%
  group_by(ID) %>%
  select(starts_with('Level')) %>%
  summarise(mean(.,na.rm = TRUE))

When I run the code, I get the following output

Adding missing grouping variables: `ID`
# A tibble: 6 x 2
  ID    `mean(., na.rm = TRUE)`
  <fct>                   <dbl>
1 A                          NA
2 B                          NA
3 C                          NA
4 D                          NA
5 E                          NA
6 F                          NA
Warning messages:
1: In mean.default(., na.rm = TRUE) :
  argument is not numeric or logical: returning NA
2: In mean.default(., na.rm = TRUE) :
  argument is not numeric or logical: returning NA
3: In mean.default(., na.rm = TRUE) :
  argument is not numeric or logical: returning NA
4: In mean.default(., na.rm = TRUE) :
  argument is not numeric or logical: returning NA
5: In mean.default(., na.rm = TRUE) :
  argument is not numeric or logical: returning NA
6: In mean.default(., na.rm = TRUE) :
  argument is not numeric or logical: returning NA

Cloud you please help me understand what wrong with my code. For proposed solutions 1) columns should be selected by matching column names against a string using functions like starts_with() or contains() in dplyr. 2) I would also like to avoid pivoting or gather functions if that possible.

I appreciate your help

Possible duplicate: [*Aggregate / summarize multiple variables per group (e.g. sum, mean)*](https://stackoverflow.com/q/9723208/2204410) — Jaap, Sep 17 '19 at 05:17
Possible duplicate of [Aggregate / summarize multiple variables per group (e.g. sum, mean)](https://stackoverflow.com/questions/9723208/aggregate-summarize-multiple-variables-per-group-e-g-sum-mean) — smci, Sep 17 '19 at 05:19
Personally [I recommend defining NA-aware functions like `mean_`, `median_` etc.](https://stackoverflow.com/a/31060373/202229) — smci, Sep 17 '19 at 05:21

Sang won kim · Answer 1 · 2019-09-17T05:35:59.390

0

DF %>%
  group_by(ID) %>%
  select(starts_with('Level')) %>%
  summarise_all(funs(mean(.,na.rm = TRUE)))

or

DF %>%
  group_by(ID) %>%
  select(starts_with('Level')) %>%
  summarise_all(list(~mean(.,na.rm = TRUE)))

You can get this :

  ID    Level1 Level2 Level3
  <fct>  <dbl>  <dbl>  <dbl>
1 A         35   36.5     50
2 B         30   84       60
3 C         10  NaN       10
4 D         15   20       25
5 E         10    1      NaN
6 F        NaN  NaN      NaN

edited Sep 17 '19 at 05:35

answered Sep 17 '19 at 05:11

Sang won kim

524
5
21

@Jon Spring Isn't work summarise_all(list(~mean(.,na.rm = TRUE)))? – Sang won kim Sep 17 '19 at 05:21
It seems like to be similar to bug that have occurred in the past. I'll look for a little more. Sry. (https://github.com/tidyverse/dplyr/issues/643) – Sang won kim Sep 17 '19 at 05:39
1

Aha! I had the `raster` package loaded from a prior question, noted in the issue as a problem. Restarting fixed it. Thanks! – Jon Spring Sep 17 '19 at 05:47
@Sangwonkim I need the means to be aggregated rather than separate. Please refer to my edited question. – RDiesel Sep 17 '19 at 05:53

Jon Spring · Accepted Answer · 2019-09-17T05:28:57.233

0

Eidt: Updated answer aggregating across "Level" columns.

DF %>%
  gather(col, value, -ID) %>%
  filter(col %>% str_starts("Level")) %>%
  group_by(ID) %>%
  summarise(mean = mean(value, na.rm = TRUE))

## A tibble: 6 x 2
#  ID     mean
#  <fct> <dbl>
#1 A      40.5
#2 B      58  
#3 C      10  
#4 D      20  
#5 E       5.5
#6 F     NaN

Orig answer Here's a variant of what Sang won kim wrote that works for me with dplyr 0.8.3, the current CRAN version.

DF %>%
  group_by(ID)  %>%
  summarise_at(vars(starts_with('Level')), mean, na.rm = TRUE)

# A tibble: 6 x 4
  ID    Level1 Level2 Level3
  <fct>  <dbl>  <dbl>  <dbl>
1 A         35   36.5     50
2 B         30   84       60
3 C         10  NaN       10
4 D         15   20       25
5 E         10    1      NaN
6 F        NaN  NaN      NaN

edited Sep 17 '19 at 05:28

answered Sep 17 '19 at 05:17

Jon Spring

55,165
4
35
53

Hi Jon, I would the mean for the 3 columns combined, and separately. Please check my desired outcome in the question I posted. – RDiesel Sep 17 '19 at 05:22
Thanks Jon. Is there a solution without converting the dataframe into the long format? – RDiesel Sep 17 '19 at 05:46
When I run your modified code, I get the following error msg: Error: Evaluation error: could not find function "str_starts". – RDiesel Sep 17 '19 at 06:02
`str_starts` is from `stringr` which should be loaded with `library(tidyverse)` as in OP. https://www.tidyverse.org/packages/ – Jon Spring Sep 17 '19 at 06:04
I have imported tidyverse and stringr and still getting the error msg. – RDiesel Sep 17 '19 at 06:07
Looks like `str_starts` was added in stringr 1.4.0 in February. You could use `str_detect("Level")` alternatively, it would however detect later occurrences of the phrase, like in `RD_Level`. https://github.com/tidyverse/stringr/releases/tag/v1.4.0 – Jon Spring Sep 17 '19 at 06:18

Summarize data by group and column name

2 Answers2