34

I want to calculate the mean for several columns and thus create a new column for the mean using dplyr and without melting + merging.

> head(growth2)
  CODE_COUNTRY CODE_PLOT IV12_ha_yr IV23_ha_yr IV34_ha_yr IV14_ha_yr IV24_ha_yr IV13_ha_yr
1            1         6       4.10       6.97         NA         NA         NA       4.58
2            1        17       9.88       8.75         NA         NA         NA       8.25
3            1        30         NA         NA         NA         NA         NA         NA
4            1        37      15.43      15.07      11.89      10.00      12.09      14.33
5            1        41      20.21      15.01      14.72      11.31      13.27      17.09
6            1        46      12.64      14.36      13.65       9.07      12.47      12.36
> 

I need a new column within the dataset with the mean of all the IV columns. I tried this:

growth2 %>% 
  group_by(CODE_COUNTRY, CODE_PLOT) %>%
  summarise(IVmean=mean(IV12_ha_yr:IV13_ha_yr, na.rm=TRUE))

And returned several errors depending on the example used, such as:

Error in NA_real_:NA_real_ : NA/NaN argument

or

Error in if (trim > 0 && n) { : missing value where TRUE/FALSE needed
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
fede_luppi
  • 1,063
  • 4
  • 17
  • 29

5 Answers5

41

You don't need to group, just select() and then mutate()

library(dplyr)
mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • 1
    This doesn't work currently (R 3.2.0 with dplyr 0.4.1). I get the following error: Error: Each argument must yield either positive or negative integers – rcorty May 17 '15 at 18:42
  • 1
    It works with `starts_with()`, but e.g. `mtcars %>% mutate(IVMean = rowMeans(select(., cyl,mpg), na.rm = TRUE))` doesn't work. It works if you take it out of the mutate. – Ruben Jan 24 '17 at 15:47
13

Here is a dplyr solution using c_across which is designed for row-wise aggregations. This makes it easy to refer to columns by name, type or position and to apply any function to the selected columns.

Note: rowwise() is a grouping operation (ie. each row is in its own group); we can reverse the grouping with an ungroup(). Thanks to @Matifou for highlighting this detail.

library("tidyverse")

df <-
  tibble::tribble(
    ~CODE_COUNTRY, ~CODE_PLOT, ~IV12_ha_yr, ~IV23_ha_yr, ~IV34_ha_yr, ~IV14_ha_yr, ~IV24_ha_yr, ~IV13_ha_yr,
    1L, 6L, 4.1, 6.97, NA, NA, NA, 4.58,
    1L, 17L, 9.88, 8.75, NA, NA, NA, 8.25,
    1L, 30L, NA, NA, NA, NA, NA, NA,
    1L, 37L, 15.43, 15.07, 11.89, 10, 12.09, 14.33,
    1L, 41L, 20.21, 15.01, 14.72, 11.31, 13.27, 17.09,
    1L, 46L, 12.64, 14.36, 13.65, 9.07, 12.47, 12.36
  )

df %>%
  rowwise() %>%
  mutate(
    mean = mean(c_across(starts_with("IV")), na.rm = TRUE),
    sd = sd(c_across(starts_with("IV")), na.rm = TRUE)
  ) %>%
  ungroup()
#> # A tibble: 6 × 10
#>   CODE_COUNTRY CODE_PLOT IV12_ha_yr IV23_ha_yr IV34_ha_yr IV14_ha_yr IV24_ha_yr
#>          <int>     <int>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1            1         6       4.1        6.97       NA        NA          NA  
#> 2            1        17       9.88       8.75       NA        NA          NA  
#> 3            1        30      NA         NA          NA        NA          NA  
#> 4            1        37      15.4       15.1        11.9      10          12.1
#> 5            1        41      20.2       15.0        14.7      11.3        13.3
#> 6            1        46      12.6       14.4        13.6       9.07       12.5
#>   IV13_ha_yr   mean     sd
#>        <dbl>  <dbl>  <dbl>
#> 1       4.58   5.22  1.54 
#> 2       8.25   8.96  0.835
#> 3      NA    NaN    NA    
#> 4      14.3   13.1   2.14 
#> 5      17.1   15.3   3.09 
#> 6      12.4   12.4   1.82

Created on 2023-04-17 with reprex v2.0.2

dipetkov
  • 3,380
  • 1
  • 11
  • 19
9

Use . in dplyr.

library(dplyr)
mutate(df, IVMean = rowMeans(select(., starts_with("IV")), na.rm = TRUE))
Shixiang Wang
  • 2,147
  • 2
  • 24
  • 33
4

I tried to comment on Rick Scriven's answer but don't have the experience points for it. Anyway, wanted to contribute. His answer said to do this:

    library(dplyr)
    mutate(df, IVMean = rowMeans(select(df, starts_with("IV")), na.rm = TRUE))

That works, but if all columns don't start with "IV", which was my case, how do you do it? Turns out, that select does not want a logical vector, so you can't use AND or OR. For example, you cannot say "starts_with('X') | starts_with('Y')". You have to build a numeric vector. Here is how it is done.

    mutate(df, IVMean = rowMeans(select(df, c(starts_with("IV"), starts_with("IX"))), na.rm = TRUE))
Jeff
  • 91
  • 7
  • You can use `cbind`: e.g., `mutate(df, mean = rowMeans(cbind(my_col_1, my_col_2), na.rm=T)))` – Nova Sep 27 '21 at 12:32
0

you can use as follows:

your data

data<- structure(list(CODE_COUNTRY = c(1L, 1L, 1L, 1L, 1L, 1L), CODE_PLOT = c(6L, 
17L, 30L, 37L, 41L, 46L), IV12_ha_yr = c(4.1, 9.88, NA, 15.43, 
20.21, 12.64), IV23_ha_yr = c(6.97, 8.75, NA, 15.07, 15.01, 14.36
), IV34_ha_yr = c(NA, NA, NA, 11.89, 14.72, 13.65), IV14_ha_yr = c(NA, 
NA, NA, 10, 11.31, 9.07), IV24_ha_yr = c(NA, NA, NA, 12.09, 13.27, 
12.47), IV13_ha_yr = c(4.58, 8.25, NA, 14.33, 17.09, 12.36)), .Names = c("CODE_COUNTRY", 
"CODE_PLOT", "IV12_ha_yr", "IV23_ha_yr", "IV34_ha_yr", "IV14_ha_yr", 
"IV24_ha_yr", "IV13_ha_yr"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

mydata <- cbind(data,IVMean=apply(data[,3:8],1,mean, na.rm=TRUE))

you can also do this

 mydata <- cbind(data,IVMean=rowMeans(data[3:8], na.rm=TRUE))
  • I guess you need to exclude the first 2 columns in the `apply` as the OP is only interested from columns `IV12_ha_yr:....` – akrun Feb 26 '15 at 14:23
  • @akrun Hi, you mean this mydata <- cbind(data[,3:8],apply(data[,3:8],1,mean)) –  Feb 26 '15 at 14:27
  • 1
    I meant `cbind(data,IVMean=apply(data[,3:8],1,mean, na.rm=TRUE))` Based on the code by the OP, it is a bit confusing what the expected output would be. – akrun Feb 26 '15 at 14:29
  • @akrun yes, this is for sure, taking into account the NA as well ! –  Feb 26 '15 at 14:34
  • 1
    Also, you can replace the `apply` step (or add as an alternative) with `rowMeans(data[3:8], na.rm=TRUE)` – akrun Feb 26 '15 at 14:35
  • @akrun for sure, we even don't need to use apply for it. I updated it –  Feb 26 '15 at 14:39