47

I can't find the answer anywhere.

I would like to calculate new variable of data frame which is based on mean of rows.

For example:

data <- data.frame(id=c(101,102,103), a=c(1,2,3), b=c(2,2,2), c=c(3,3,3))

I want to use mutate to make variable d which is mean of a,b and c. And I would like to be able to make that by selecting columns in way d=mean(a,b,c), and also I need to use range of variables (like in dplyr) d=mean(a:c).

And of course

mutate(data, c=mean(a,b)) 

or

mutate(data, c=rowMeans(a,b)) 

doesn't work.

Can you give me some tip?

Regards

Tomasz Wojtas
  • 756
  • 2
  • 6
  • 12

8 Answers8

48

You're looking for

data %>% 
    rowwise() %>% 
    mutate(c=mean(c(a,b)))

#      id     a     b     c
#   (dbl) (dbl) (dbl) (dbl)
# 1   101     1     2   1.5
# 2   102     2     2   2.0
# 3   103     3     2   2.5

or

library(purrr)
data %>% 
    rowwise() %>% 
    mutate(c=lift_vd(mean)(a,b))
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • 7
    Ah. Not bad at all. Unfortunately the documentation of `rowwise` is terrible (“`rowwise` does something under some situations. Here’s an undescriptive example for a single special case that cannot be generalised.”) so I end up never using it. :-( – Konrad Rudolph Oct 30 '15 at 15:22
  • Another downside of `rowwise` is that you have to `ungroup` afterwards. A more appropriate name for the function might be `group_row` or something that makes the grouping explicit. – ha-pu Apr 19 '22 at 15:09
20

dplyr is badly suited to operate on this kind of data because it assumes tidy data format and — for the problem in question — your data is untidy.

You can of course tidy it first:

tidy_data = tidyr::gather(data, name, value, -id)

Which looks like this:

   id name value
1 101    a     1
2 102    a     2
3 103    a     3
4 101    b     2
5 102    b     2
6 103    b     2
    …

And then:

tidy_data %>% group_by(id) %>% summarize(mean = mean(value))
    name  mean
  (fctr) (dbl)
1      a     2
2      b     2
3      c     3

Of course this discards the original data. You could use mutate instead of summarize to avoid this. Finally, you can then un-tidy your data again:

tidy_data %>%
    group_by(id) %>%
    mutate(mean = mean(value)) %>%
    tidyr::spread(name, value)
     id     mean     a     b     c
  (dbl)    (dbl) (dbl) (dbl) (dbl)
1   101 2.000000     1     2     3
2   102 2.333333     2     2     3
3   103 2.666667     3     2     3

Alternatively, you could summarise and then merge the result with the original table:

tidy_data %>%
    group_by(id) %>%
    summarize(mean = mean(value)) %>%
    inner_join(data, by = 'id')

The result is the same in either case. I conceptually prefer the second variant.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Incidentally: reshaping the data can be inefficient for very big tables but I use equivalent code with a data.frame that has several million rows, and it still works fine. – Konrad Rudolph Oct 29 '15 at 18:23
  • I forgot about that for dplyr data should be tidy - but in fact, for my purposes that kind of transformation seems to be very oblique way... But now i get it! :) – Tomasz Wojtas Oct 29 '15 at 20:43
  • It would seem to me that OP's issue has more to do with `mean.default`'s awkward signature (dots after defaults) interacting with **dplyr**'s argument mapping than **dplyr**'s inability to work with 'untidy' data. – Matthew Plourde Oct 30 '15 at 15:10
12

And yet another couple of ways, useful if you have the numeric positions or vector names of the columns to be summarised:

data %>% mutate(d = rowMeans(.[, 2:4]))

or

data %>% mutate(d = rowMeans(.[, c("a","b","c")]))
JWilliman
  • 3,558
  • 32
  • 36
  • pretty similar to my answer, but I like how you don't need to use `data.frame()` that kind of always bothered me. thanks. – j_5chneider Mar 27 '18 at 20:16
  • 2
    this is the best solution, but made simpler by using select on the piped-in data frame. Have added an answer. – bjw May 28 '19 at 15:13
12

I think the answer suggesting using data.frame or slicing on . is the best, but could be made simpler and more dplyr-ish like so:

data %>% mutate(c = rowMeans(select(., a,b)))

Or if you want to avoid ., with the penalty of having two inputs to your pipeline:

data %>% mutate(c = rowMeans(select(data, a,b)))
bjw
  • 2,046
  • 18
  • 33
5

I think this is the dplyr-ish way. First, I'd create a function:

my_rowmeans = function(...) Reduce(`+`, list(...))/length(list(...))

Then, it can be used inside mutate:

data %>% mutate(rms = my_rowmeans(a, b))

#    id a b c rms
# 1 101 1 2 3 1.5
# 2 102 2 2 3 2.0
# 3 103 3 2 3 2.5

# or

data %>% mutate(rms = my_rowmeans(a, b, c))

#    id a b c      rms
# 1 101 1 2 3 2.000000
# 2 102 2 2 3 2.333333
# 3 103 3 2 3 2.666667

To deal with the possibility of NAs, the function must be uglified:

my_rowmeans = function(..., na.rm=TRUE){
  x = 
    if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x)))) 
    else       list(...)

  d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)

  Reduce(`+`, x)/d
} 

# alternately...

my_rowmeans2 = function(..., na.rm=TRUE) rowMeans(cbind(...), na.rm=na.rm)

# new example

data$b[2] <- NA  
data %>% mutate(rms = my_rowmeans(a,b,na.rm=FALSE))

   id a  b c rms
1 101 1  2 3 1.5
2 102 2 NA 3  NA
3 103 3  2 3 2.5

data %>% mutate(rms = my_rowmeans(a,b))

   id a  b c rms
1 101 1  2 3 1.5
2 102 2 NA 3 2.0
3 103 3  2 3 2.5

The downside to the my_rowmeans2 is that it coerces to a matrix. I'm not certain that this will always be slower than the Reduce approach, though.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • my_rowmeans = function(...) Reduce(`+`, list(...))/length(list(...)) this is the very close solution of my problem. But how to deal with NAs? na.rm parameter would be very useful ;-) – Tomasz Wojtas Oct 29 '15 at 20:43
  • @TomaszWojtas Updated. It would be better if your initial post reflected this as well (rather than extending the question in comments). – Frank Oct 29 '15 at 21:02
5

Another simple possibility with few code is:

data %>%
    mutate(c= rowMeans(data.frame(a,b)))

 #     id a b   c
 #  1 101 1 2 1.5
 #  2 102 2 2 2.0
 #  3 103 3 2 2.5

As rowMeans needs something like a matrix or a data.frame, you can use data.frame(var1, var2, ...) instead of c(var1, var2, ...). If you have NAs in your data you'll need to tell R what to do, for example to remove them: rowMeans(data.frame(a,b), na.rm=TRUE)

j_5chneider
  • 390
  • 5
  • 15
0

If you'd like to use a pivot_longer()-style solution:

data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value))%>%
pivot_wider(names_from=name, values_from=value)

Note that this requires the tidyr package.

This is my preference for the fact that I only need to type the name of my ID column, and don't have to worry about column indices or names otherwise. Good for a quick copy-and-point-this-at-different-data solution, though the same can be said of other answers here. Also good for cases where you might have more than one column with categorical information and haven't created a single unique identifier column.

For what it's worth, I found that this solution is very easily modified to ignore NA values with simple addition of na.rm=TRUE in the mean calculation.

For example:

data <- data.frame(id=c(101,102,103), a=c(NA,2,3), b=c(2,2,2), c=c(3,3,3))


data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value,na.rm=TRUE))%>%
pivot_wider(names_from = name, values_from=value)
Pake
  • 968
  • 9
  • 24
0

You can use a wrapper function around rowMeans() to make it easier to work with. The one below lets you specify na.rm, and you can use tidyselect to choose your columns if you want.

# This is the wrapper function
means <- function(..., na.rm = FALSE) {
  rowMeans(data.frame(...), na.rm = na.rm)
}

library(dplyr)

# Example data
iris2 <- iris %>% 
  head() %>% 
  transmute(Sepal.Length = replace(Sepal.Length,
                                   sample(c(TRUE, FALSE), nrow(.),
                                          replace = TRUE),
                                   NA),
            Sepal.Width,
            Petal.Length,
            Petal.Width) %>%
  print()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1           NA         3.5          1.4         0.2
#> 2           NA         3.0          1.4         0.2
#> 3           NA         3.2          1.3         0.2
#> 4          4.6         3.1          1.5         0.2
#> 5           NA         3.6          1.4         0.2
#> 6          5.4         3.9          1.7         0.4


# Basic usage
iris2 %>% 
  mutate(mean_sepal = means(Sepal.Length, Sepal.Width))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1           NA         3.5          1.4         0.2         NA
#> 2           NA         3.0          1.4         0.2         NA
#> 3           NA         3.2          1.3         0.2         NA
#> 4          4.6         3.1          1.5         0.2       3.85
#> 5           NA         3.6          1.4         0.2         NA
#> 6          5.4         3.9          1.7         0.4       4.65


# If you want to exclude NAs
iris2 %>% 
  mutate(mean_sepal = means(Sepal.Length, Sepal.Width, na.rm = TRUE))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1           NA         3.5          1.4         0.2       3.50
#> 2           NA         3.0          1.4         0.2       3.00
#> 3           NA         3.2          1.3         0.2       3.20
#> 4          4.6         3.1          1.5         0.2       3.85
#> 5           NA         3.6          1.4         0.2       3.60
#> 6          5.4         3.9          1.7         0.4       4.65


# You can also use select() and choose columns using tidyselect
iris2 %>% 
  mutate(mean_sepal = means(select(., contains("Sepal")), na.rm = TRUE))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1           NA         3.5          1.4         0.2       3.50
#> 2           NA         3.0          1.4         0.2       3.00
#> 3           NA         3.2          1.3         0.2       3.20
#> 4          4.6         3.1          1.5         0.2       3.85
#> 5           NA         3.6          1.4         0.2       3.60
#> 6          5.4         3.9          1.7         0.4       4.65

Created on 2022-01-13 by the reprex package (v2.0.1)

Oliver
  • 1,098
  • 1
  • 11
  • 16