23

I have a data.frame such as this (the real data set has many more rows and columns)

set.seed(15)
dd <- data.frame(id=letters[1:4], matrix(runif(5*4), nrow=4))

#   id        X1        X2        X3        X4        X5
# 1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437
# 2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670
# 3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871
# 4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125

I would like to be able to write a dplyr statement where I can select a subset of columns and mutate them. (I'm trying to do something similar to using .SDcols in data.table).

For a simplified example, here's the function I would like to be able to write to add columns for the sums and means of the even "X" columns while preserving all other columns. The desired output using base R is

(cols<-paste0("X", c(2,4)))
# [1] "X2" "X4"
cbind(dd,evensum=rowSums(dd[,cols]),evenmean=rowMeans(dd[,cols]))

#   id        X1        X2        X3        X4        X5   evensum  evenmean
# 1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
# 2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
# 3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
# 4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

but I wanted to use a dplyr-like chain to do the same thing. In the general case, I'd like to be able to use any of select()'s helper functions such as starts_with, ends_with, matches, etc and any function. Here's what I tried

library(dplyr)
partial_mutate1 <- function(x, colspec, ...) {
    select_(x, .dots=list(lazyeval::lazy(colspec))) %>% 
    transmute_(.dots=lazyeval::lazy_dots(...)) %>% 
    cbind(x,.)
}

dd %>% partial_mutate1(num_range("X", c(2,4)), 
    evensum=rowSums(.), evenmean=rowMeans(.))

However, This throws an error that says

Error in rowSums(.) : 'x' must be numeric

Which appears to be because . seems to be referring to the entire date.frame rather than the selected subset. (same error as rowSums(dd)). However, note that this produces the desired output

partial_mutate2 <- function(x, colspec) {
    select_(x, .dots=list(lazyeval::lazy(colspec))) %>% 
    transmute(evensum=rowSums(.), evenmean=rowMeans(.)) %>% 
    cbind(x,.)
}
dd %>% partial_mutate2(seq(2,ncol(dd),2))

I'm guessing this is some sort of environment problem? Any suggestions on how to pass the arguments to partial_mutate1 so that the . will correctly take values from the "select()-ed" dataset?

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • An ugly way would be: `dd %>% select(X2, X4) %>% mutate(evensum = rowSums(.), evenmean = rowMeans(.)) %>% select(-X2, -X4) %>% cbind(., dd)` – Steven Beaupré Feb 26 '15 at 19:33
  • I suspect the problem is that attempting the SE is messing with `%>%`. In other words, with `rowMeans(.)` burried inside `.dots`, `%>%` has no way of knowing it should be substituting the data there as well. This is just a guess. – BrodieG Feb 26 '15 at 20:12
  • I think you're right @BrodieG. After a bit more digging this is more of a magrittr issue than a dplyr issue. For example: `mutate(dd[,-1], sums=rowSums(.))` doesn't work ("object '.' not found"). So the `.` symbol isn't special to `dplyr`. Trying to use it to apply a function across columns seems to be the wrong idea. I guess I should be reshaping the data to a "tidy" format first. – MrFlick Feb 26 '15 at 20:15
  • Although SO shows the link on the right side, I thought it's better visible here: [This question](https://stackoverflow.com/questions/33314971/mutate-and-rowsums-exclude-columns?noredirect=1&lq=1) helps how to use `starts_with` and other select helper functions - the syntax which was suggested by @Brandon Bertelsen now seems to work, i.e. `mutate(new_col = rowSums(select(., starts_with(string))))` – tjebo Sep 03 '18 at 14:09

6 Answers6

13

Am I missing something or would this work as expected:

cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
#  id        X1        X2        X3        X4        X5   evensum  evenmean
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

Or are you specifically looking for a custom function to do this?


Not exactly what you are looking for but if you want to do it inside a pipe you could use select explicitly inside mutate like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
#  id        X1        X2        X3        X4        X5        xy
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535

However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):

f <- function(x, ...) {
  n <- nrow(x)
  x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
  matrix(unlist(x), nrow = n, byrow = FALSE)
}

And then apply it like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
#  id        X1        X2        X3        X4        X5      xy.1      xy.2
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592
talat
  • 68,970
  • 21
  • 126
  • 157
  • The part that's missing is that I would like to be able to use the column selection functions that `?select` provides such as starts_with, ends_with, etc. Those will not work "nicely" outside of `select()`. – MrFlick Feb 26 '15 at 20:02
  • 1
    Of course I could use `cols <- dplyr:::num_range(names(dd), "X", c(2,4))`. The thing is it wouldn't be a part of the chain and it would also break if any column orders are changed in the chain after the column indexes are calculated. It would be better to do it "on demand" if possible – MrFlick Feb 26 '15 at 20:10
  • @MrFlick, I see what you mean. Just got it wrong at first. – talat Feb 26 '15 at 20:12
2

A number-of-columns agnostic approach using dplyr:

dd %>% 
  select(-id) %>% 
  mutate(evensum = rowSums(.[,1:length(.[1,])%%2==0]), 
         evenmean = rowMeans(.[,1:length(.[1,])%%2==0])) %>% 
  cbind(id=dd[,1],.)

  id        X1        X2        X3        X4        X5   evensum  evenmean
1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380812
2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478767
Andrew Taylor
  • 3,438
  • 1
  • 26
  • 47
1

tidyr::nest() understands the same selector syntax as dplyr::select(), so one approach would be to consolidate the columns of interest into a single column-of-dataframes, perform the necessary operations on that column-of-dataframes, and unnest to get back a flat data frame:

library( tidyverse )
dd %>% nest( X2, X4, .key="Slice" ) %>%
    mutate( evensum = map(Slice, rowSums),
           evenmean = map(Slice, rowMeans),
           evensd = map(Slice, pmap_dbl, lift_vd(sd)) ) %>%
    unnest
#   id       X1    X3    X5 evensum evenmean evensd    X2    X4
# 1 a     0.602 0.687 0.447   0.876    0.438 0.100  0.367 0.509
# 2 b     0.195 0.831 0.965   1.70     0.848 0.200  0.989 0.707
# 3 c     0.966 0.105 0.141   1.68     0.839 0.0333 0.815 0.862
# 4 d     0.651 0.646 0.777   1.10     0.548 0.416  0.254 0.842

Since data frames are basically lists, this approach is naturally suited for applying arbitrary functions (such as sd above) to arbitrary an set of columns using purrr::pmap() family of functions.

Side note: Since sd works on vectors, we use purrr::lift_vd to convert its interface to be suitable for pmap:

sd( c(0.367, 0.509) )        # 0.100
lift_vd(sd)( 0.367, .509 )   # 0.100
Artem Sokolov
  • 13,196
  • 4
  • 43
  • 74
0

Another option is to use rowwise() plus c_across(). This type of operation won't work with rowSums or rowMeans but will work with the regular sum() and mean() functions. The c_across() function returns multiple columns as a simple vector. It also accepts any of the tidyselect helper functions. So for example you can do

dd %>% 
  rowwise() %>% 
  mutate(
    evensum = sum( c_across(all_of(cols)) ), 
    evenmean = mean( c_across(all_of(cols)) )
)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • You can make it work without `rowwise` (which is very slow) `dd %>% mutate( evensum = rowSums(across(all_of(cols))), evenmean = rowMeans(across(all_of(cols))) )` – Julien Jun 28 '23 at 15:56
0

Use across with all_of (or any_of)

dd %>% 
  mutate(
    evensum = rowSums(across(all_of(cols))), 
    evenmean = rowMeans(across(all_of(cols)))
)
Julien
  • 1,613
  • 1
  • 10
  • 26
-2

In newer versions of dplyr, you can use the new mutate_at()

function

mutate_at(dd, vars(starts_with("X")), somefunction)
hnagaty
  • 796
  • 5
  • 13
  • This does not solve the problem posed in the question. `mutate_at` still only operates on one column at a time. There is no way to to use `rowSums` for example with `mutate_at`. – MrFlick Apr 04 '19 at 15:36