Performing dplyr mutate on subset of columns

Question

I have a data.frame such as this (the real data set has many more rows and columns)

set.seed(15)
dd <- data.frame(id=letters[1:4], matrix(runif(5*4), nrow=4))

#   id        X1        X2        X3        X4        X5
# 1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437
# 2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670
# 3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871
# 4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125

I would like to be able to write a dplyr statement where I can select a subset of columns and mutate them. (I'm trying to do something similar to using .SDcols in data.table).

For a simplified example, here's the function I would like to be able to write to add columns for the sums and means of the even "X" columns while preserving all other columns. The desired output using base R is

(cols<-paste0("X", c(2,4)))
# [1] "X2" "X4"
cbind(dd,evensum=rowSums(dd[,cols]),evenmean=rowMeans(dd[,cols]))

#   id        X1        X2        X3        X4        X5   evensum  evenmean
# 1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
# 2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
# 3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
# 4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

but I wanted to use a dplyr-like chain to do the same thing. In the general case, I'd like to be able to use any of select()'s helper functions such as starts_with, ends_with, matches, etc and any function. Here's what I tried

library(dplyr)
partial_mutate1 <- function(x, colspec, ...) {
    select_(x, .dots=list(lazyeval::lazy(colspec))) %>% 
    transmute_(.dots=lazyeval::lazy_dots(...)) %>% 
    cbind(x,.)
}

dd %>% partial_mutate1(num_range("X", c(2,4)), 
    evensum=rowSums(.), evenmean=rowMeans(.))

However, This throws an error that says

Error in rowSums(.) : 'x' must be numeric

Which appears to be because . seems to be referring to the entire date.frame rather than the selected subset. (same error as rowSums(dd)). However, note that this produces the desired output

partial_mutate2 <- function(x, colspec) {
    select_(x, .dots=list(lazyeval::lazy(colspec))) %>% 
    transmute(evensum=rowSums(.), evenmean=rowMeans(.)) %>% 
    cbind(x,.)
}
dd %>% partial_mutate2(seq(2,ncol(dd),2))

I'm guessing this is some sort of environment problem? Any suggestions on how to pass the arguments to partial_mutate1 so that the . will correctly take values from the "select()-ed" dataset?

An ugly way would be: `dd %>% select(X2, X4) %>% mutate(evensum = rowSums(.), evenmean = rowMeans(.)) %>% select(-X2, -X4) %>% cbind(., dd)` — Steven Beaupré, Feb 26 '15 at 19:33
I suspect the problem is that attempting the SE is messing with `%>%`. In other words, with `rowMeans(.)` burried inside `.dots`, `%>%` has no way of knowing it should be substituting the data there as well. This is just a guess. — BrodieG, Feb 26 '15 at 20:12
I think you're right @BrodieG. After a bit more digging this is more of a magrittr issue than a dplyr issue. For example: `mutate(dd[,-1], sums=rowSums(.))` doesn't work ("object '.' not found"). So the `.` symbol isn't special to `dplyr`. Trying to use it to apply a function across columns seems to be the wrong idea. I guess I should be reshaping the data to a "tidy" format first. — MrFlick, Feb 26 '15 at 20:15
Although SO shows the link on the right side, I thought it's better visible here: [This question](https://stackoverflow.com/questions/33314971/mutate-and-rowsums-exclude-columns?noredirect=1&lq=1) helps how to use `starts_with` and other select helper functions - the syntax which was suggested by @Brandon Bertelsen now seems to work, i.e. `mutate(new_col = rowSums(select(., starts_with(string))))` — tjebo, Sep 03 '18 at 14:09

talat · Accepted Answer · 2015-02-26T21:46:31.077

Am I missing something or would this work as expected:

cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
#  id        X1        X2        X3        X4        X5   evensum  evenmean
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

Or are you specifically looking for a custom function to do this?

Not exactly what you are looking for but if you want to do it inside a pipe you could use select explicitly inside mutate like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
#  id        X1        X2        X3        X4        X5        xy
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535

However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):

f <- function(x, ...) {
  n <- nrow(x)
  x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
  matrix(unlist(x), nrow = n, byrow = FALSE)
}

And then apply it like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
#  id        X1        X2        X3        X4        X5      xy.1      xy.2
#1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592

The part that's missing is that I would like to be able to use the column selection functions that `?select` provides such as starts_with, ends_with, etc. Those will not work "nicely" outside of `select()`. — MrFlick, Feb 26 '15 at 20:02
Of course I could use `cols <- dplyr:::num_range(names(dd), "X", c(2,4))`. The thing is it wouldn't be a part of the chain and it would also break if any column orders are changed in the chain after the column indexes are calculated. It would be better to do it "on demand" if possible — MrFlick, Feb 26 '15 at 20:10

Andrew Taylor · Answer 2 · 2015-02-26T20:16:03.063

A number-of-columns agnostic approach using dplyr:

dd %>% 
  select(-id) %>% 
  mutate(evensum = rowSums(.[,1:length(.[1,])%%2==0]), 
         evenmean = rowMeans(.[,1:length(.[1,])%%2==0])) %>% 
  cbind(id=dd[,1],.)

  id        X1        X2        X3        X4        X5   evensum  evenmean
1  a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380812
2  b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
3  c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
4  d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478767

score 1 · Answer 3 · answered Jan 10 '19 at 23:59

tidyr::nest() understands the same selector syntax as dplyr::select(), so one approach would be to consolidate the columns of interest into a single column-of-dataframes, perform the necessary operations on that column-of-dataframes, and unnest to get back a flat data frame:

library( tidyverse )
dd %>% nest( X2, X4, .key="Slice" ) %>%
    mutate( evensum = map(Slice, rowSums),
           evenmean = map(Slice, rowMeans),
           evensd = map(Slice, pmap_dbl, lift_vd(sd)) ) %>%
    unnest
#   id       X1    X3    X5 evensum evenmean evensd    X2    X4
# 1 a     0.602 0.687 0.447   0.876    0.438 0.100  0.367 0.509
# 2 b     0.195 0.831 0.965   1.70     0.848 0.200  0.989 0.707
# 3 c     0.966 0.105 0.141   1.68     0.839 0.0333 0.815 0.862
# 4 d     0.651 0.646 0.777   1.10     0.548 0.416  0.254 0.842

Since data frames are basically lists, this approach is naturally suited for applying arbitrary functions (such as sd above) to arbitrary an set of columns using purrr::pmap() family of functions.

Side note: Since sd works on vectors, we use purrr::lift_vd to convert its interface to be suitable for pmap:

sd( c(0.367, 0.509) )        # 0.100
lift_vd(sd)( 0.367, .509 )   # 0.100

score 0 · Answer 4 · answered Apr 08 '22 at 04:54

0

Another option is to use rowwise() plus c_across(). This type of operation won't work with rowSums or rowMeans but will work with the regular sum() and mean() functions. The c_across() function returns multiple columns as a simple vector. It also accepts any of the tidyselect helper functions. So for example you can do

dd %>% 
  rowwise() %>% 
  mutate(
    evensum = sum( c_across(all_of(cols)) ), 
    evenmean = mean( c_across(all_of(cols)) )
)

answered Apr 08 '22 at 04:54

MrFlick

195,160
17
277
295

You can make it work without `rowwise` (which is very slow) `dd %>% mutate( evensum = rowSums(across(all_of(cols))), evenmean = rowMeans(across(all_of(cols))) )` – Julien Jun 28 '23 at 15:56

score 0 · Answer 5 · answered Jun 28 '23 at 15:49

0

Use across with all_of (or any_of)

dd %>% 
  mutate(
    evensum = rowSums(across(all_of(cols))), 
    evenmean = rowMeans(across(all_of(cols)))
)

answered Jun 28 '23 at 15:49

Julien

1,613
1
10
26

score -2 · Answer 6 · answered Apr 04 '19 at 09:27

-2

In newer versions of dplyr, you can use the new mutate_at()

function

mutate_at(dd, vars(starts_with("X")), somefunction)

answered Apr 04 '19 at 09:27

hnagaty

796
5
13

This does not solve the problem posed in the question. `mutate_at` still only operates on one column at a time. There is no way to to use `rowSums` for example with `mutate_at`. – MrFlick Apr 04 '19 at 15:36

Performing dplyr mutate on subset of columns

6 Answers6

Linked

Related