13

Recently I stumbled uppon a strange behaviour of dplyr and I would be happy if somebody would provide some insights.

Assuming I have a data of which com columns contain some numerical values. In an easy scenario I would like to compute rowSums. Although there are many ways to do it, here are two examples:

df <- data.frame(matrix(rnorm(20), 10, 2),
                 ids = paste("i", 1:20, sep = ""),
                 stringsAsFactors = FALSE)

# works
dplyr::select(df, - ids) %>% {rowSums(.)}

# does not work
# Error: invalid argument to unary operator
df %>%
  dplyr::mutate(blubb = dplyr::select(df, - ids) %>% {rowSums(.)})

# does not work
# Error: invalid argument to unary operator
df %>%
  dplyr::mutate(blubb = dplyr::select(., - ids) %>% {rowSums(.)})

# workaround:
tmp <- dplyr::select(df, - ids) %>% {rowSums(.)}
df %>%
  dplyr::mutate(blubb = tmp)

# works
rowSums(dplyr::select(df, - ids))

# does not work
# Error: invalid argument to unary operator
df %>%
  dplyr::mutate(blubb = rowSums(dplyr::select(df, - ids)))

# workaround
tmp <- rowSums(dplyr::select(df, - ids))
df %>%
  dplyr::mutate(blubb = tmp)

First, I don't really understand what is causing the error and second I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way.

edit

The question mutate and rowSums exclude columns , although related, focuses on using rowSums for computation. Here I'm eager to understand why the upper examples do not work. It is not so much about how to solve (see the workarounds) but to understand what happens when the naive approach is applied.

phalteman
  • 3,442
  • 1
  • 29
  • 46
Drey
  • 3,314
  • 2
  • 21
  • 26
  • Why not `dplyr::select(df, - ids) %>% mutate(foo=rowSums(.))` – Haboryme Jan 27 '17 at 13:50
  • Try with `ids = paste("-i", 1:20, sep = "")` I guess I had read this somewhere – joel.wilson Jan 27 '17 at 13:55
  • 1
    Possible duplicate of [mutate and rowSums exclude columns](http://stackoverflow.com/questions/33314971/mutate-and-rowsums-exclude-columns) – Weihuang Wong Jan 27 '17 at 14:35
  • @WeihuangWong the question there, although related, focuses on using `rowSums` for computation. Here I'm eager to understand why the upper examples do not work. It is not so much about how to solve (see the workarounds) but to understand what happens when the naive appraoch is applied – Drey Jan 27 '17 at 14:38
  • @Haboryme I would like to keep ids for later use. Hence I would like to make selection in the `mutate` function. – Drey Jan 27 '17 at 14:39
  • @joel.wilson Thank you, unfortunately, this does not resolve any issues in the examples above – Drey Jan 27 '17 at 14:41

6 Answers6

33

The examples do not work because you are nesting select in mutate and using bare variable names. In this case, select is trying to do something like

> -df$ids
Error in -df$ids : invalid argument to unary operator

which fails because you can't negate a character string (i.e. -"i1" or -"i2" makes no sense). Either of the formulations below works:

df %>% mutate(blubb = rowSums(select_(., "X1", "X2")))
df %>% mutate(blubb = rowSums(select(., -3)))

or

df %>% mutate(blubb = rowSums(select_(., "-ids")))

as suggested by @Haboryme.

Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48
  • 1
    Or `df %>% mutate(blubb = rowSums(select_(., "-ids")))` which might be a bit more convenient to use. – Haboryme Jan 27 '17 at 15:44
  • Is it possible to pattern match the columns to be selected? Instead of "-ids" something like starts_with("X). `> df %>% mutate(blubb = rowSums(select_(., starts_with("X")))) Error in mutate_impl(.data, dots) : Evaluation error: Variable context not set.` – rpm Jan 10 '18 at 19:12
  • 3
    Using select instead of select_ does it. – rpm Jan 10 '18 at 19:18
5

select_ is deprecated. You can use:

library(dplyr)
df <- data.frame(matrix(rnorm(20), 10, 2),
                 ids = paste("i", 1:20, sep = ""),
                 stringsAsFactors = FALSE)
df %>% 
  mutate(blubb = rowSums(select(., .dots = c("X1", "X2"))))

# Or more generally:
desired_columns <- c("X1", "X2")
df %>% 
  mutate(blubb = rowSums(select(., .dots = all_of(desired_columns))))
HBat
  • 4,873
  • 4
  • 39
  • 56
2

select can now accept bare column names so no need to use .dots or select_ which has been deprecated.

Here are few of the approaches that can work now.

library(dplyr)

#sum all the columns except `id`. 
df %>% mutate(blubb = rowSums(select(., -ids), na.rm = TRUE))

#sum X1 and X2 columns
df %>% mutate(blubb = rowSums(select(., X1, X2), na.rm = TRUE))

#sum all the columns that start with 'X'
df %>% mutate(blubb = rowSums(select(., starts_with('X')), na.rm = TRUE))

#sum all the numeric columns
df %>% mutate(blubb = rowSums(select(., where(is.numeric))))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Adding to this old thread because I searched on this question then realized I was asking the wrong question. Also, I detect some yearning in this and related questions for the proper pipe steps way to do this.

The answers here are somewhat non-intuitive because they are trying to use the dplyr vernacular with non-"tidy" data. IF you want to do it the dplyr way, make the data tidy first, using gather(), and then use summarise()

library(tidyverse)

df <- data.frame(matrix(rnorm(20), 10, 2),
                 ids = paste("i", 1:20, sep = ""),
                 stringsAsFactors = FALSE)

df %>% gather(key=Xn,value="value",-ids) %>% 
  group_by(ids) %>% 
  summarise(rowsum=sum(value))

#> # A tibble: 20 x 2
#>    ids   rowsum
#>    <chr>       <dbl>
#>  1 i1          0.942
#>  2 i10        -0.330
#>  3 i11         0.942
#>  4 i12        -0.721
#>  5 i13         2.50 
#>  6 i14        -0.611
#>  7 i15        -0.799
#>  8 i16         1.84 
#>  9 i17        -0.629
#> 10 i18        -1.39 
#> 11 i19         1.44 
#> 12 i2         -0.721
#> 13 i20        -0.330
#> 14 i3          2.50 
#> 15 i4         -0.611
#> 16 i5         -0.799
#> 17 i6          1.84 
#> 18 i7         -0.629
#> 19 i8         -1.39 
#> 20 i9          1.44

If you care about the order of the ids when they are not sortable using arrange(), make that column a factor first.

  df %>% 
  mutate(ids=as_factor(ids)) %>% 
  gather(key=Xn,value="value",-ids) %>% 
  group_by(ids) %>% 
  summarise(rowsum=sum(value))
Art
  • 1,165
  • 6
  • 18
  • Yes, thank you for pointing this out. In hindsight it seems that the proposed data was not tidy in the first place! – Drey Aug 28 '18 at 19:37
0

Why do you want to use the pipe operator? Just write an expression such as:

rowSums(df[,sapply(df, is.numeric)])

i.e. calculate the rowsums on all the numeric columns, with the advantage of not needing to specify ids.

nadizan
  • 1,323
  • 10
  • 23
  • Thank you, however, this does not answer the first question. I have some solutions posted above, and your surely applies but it does not address the question what is actually wrong with the other stamements. – Drey Jan 27 '17 at 14:39
  • @Drey, it actually answers your second question, "I would like to know how to actually achieve a tidy computation of some (viable) columns in a tidy way". – nadizan Jan 27 '17 at 14:45
  • But the major concern of mine is why the above ones do not work, although the workarounds do. – Drey Jan 27 '17 at 15:11
0

If you want to save your results as a column within data, you can use data.table syntax like this:

dt <- as.data.table(df)
dt[, x3 := rowSums(.SD, na.rm=T), .SDcols = which(sapply(dt, is.numeric))]
juliamm2011
  • 136
  • 3