7

I will use the following data set to illustrate my questions:

my_df <- data.frame(
    a = 1:10,
    b = 10:1
)
colnames(my_df) <- c("a", "b")

Part 1

I use the mutate() function to create two new variables in my data set and I would like to compute the row means of the two new columns inside the same mutate() call. However, I would really like to be able to use the select() helpers such as starts_with(), ends_with() or contains().

My first try:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(ends_with("2")))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: No tidyselect variables were registered.

I understand why there is an error - the select() function is not given any .data argument. So I change the code in...

... my second try by adding "." inside the select() function:

my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(., ends_with("2")))
    )
    a  b a_2 b_2 mean
1   1 10   1 100  NaN
2   2  9   4  81  NaN
3   3  8   9  64  NaN
4   4  7  16  49  NaN
5   5  6  25  36  NaN
6   6  5  36  25  NaN
7   7  4  49  16  NaN
8   8  3  64   9  NaN
9   9  2  81   4  NaN
10 10  1 100   1  NaN

The new problem after the second try is that the mean column does not contain the mean of a_2 and b_2 as expected, but contains NaNs only. After studying the code a bit, I understood the second problem. The added "." in the select() function refers to the original my_df data frame, which does not have the a_2 and b_2 columns. So it makes sense that NaNs are produced because I am asking R to compute the means of nonexistent values.

I then tried to use dplyr functions such as current_vars() to see if it would make a difference:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(current_vars(), ends_with("2")))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: Variable context not set.

However, this is obviously NOT the way to use this function. The solution is to simply add a second mutate() function:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2
    ) %>%
    mutate(mean = rowMeans(select(., ends_with("2"))))
    a  b a_2 b_2 mean
1   1 10   1 100 50.5
2   2  9   4  81 42.5
3   3  8   9  64 36.5
4   4  7  16  49 32.5
5   5  6  25  36 30.5
6   6  5  36  25 30.5
7   7  4  49  16 32.5
8   8  3  64   9 36.5
9   9  2  81   4 42.5
10 10  1 100   1 50.5

Question 1: Is there any way to perform this task in the same mutate() call? Using a second mutate() function is not really an issue anyway; however, I am curious to know if there exists a way to refer to currently existing variables. The mutate() function allows for the usage of variables as soon as they are created inside the same mutate() call; however, this becomes problematic when functions are nested as shown in my example above.

Part 2

I also realize that using rowMeans() works in my solution; however, it is not really a dplyr-way of doing things especially because I need to use select() inside it. So, I decided to use the rowwise() and mean() functions instead. But once again, I would like to use one of the select() helpers for that and not have to list all variables in a c() function. I tried:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2
    ) %>%
    rowwise() %>%
    mutate(
        mean = mean(ends_with("2"))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: No tidyselect variables were registered.

I suspect that the error in the code is due to the fact that ends_with() is not inside select(), but I am showing this to ask whether there is a way to list the variables I want without having to specify them individually.

Thank you for your time.

InfiniteFlash
  • 1,038
  • 1
  • 10
  • 22
SavedByJESUS
  • 3,262
  • 4
  • 32
  • 47
  • Your question in #2 baffles me. `my_df %>% mutate(a_2 = a^2, b_2 = b^2) %>% rowwise()%>% select(. , ends_with("2"))` is the object that you want to run `means()` upon, but this will never work because `rowMeans()` is designed to work horizontally while `means()` is not. – InfiniteFlash Jan 20 '18 at 10:17
  • @InfiniteFlashChess What do you mean "for #1, I'm referencing"? Also, with regards to question #2, what package does the `means()` function belong to? And yes, I specified in the question that I am trying to compute horizontal means. This is why I used `rowMeans()` in the first part and a combination of `rowwise()` and `mean()` in the second part. – SavedByJESUS Jan 20 '18 at 15:45
  • well, the point is that the function `mean()` won't operate that way you intend it to. I was "referencing #1" because it seemed worthy of a bounty. Likely, we'll need Hadley (or someone very proficient here) to answer it :) – InfiniteFlash Jan 24 '18 at 07:53
  • @InfiniteFlashChess I understand that. The input to the mean function is a numeric vector. It is actually possible to combine `rowwise()` and `mean()`; however, you need to manually specify column names in a `c()` function. I was just wondering if there existed a way to use one of the select helpers to perform the same task. – SavedByJESUS Jan 25 '18 at 15:37
  • SavedByJESUS, would definitely consider bountying Problem #1 and have someone attempt to answer it (I am interested in performing #1 properly as well!) – InfiniteFlash Jan 29 '18 at 02:55

2 Answers2

2

A bit late, but here is a solution to problem 1, for the reference.

If you had to do it without pipes, you would write:

tmp1 = mutate(my_df, a_2 = a^2, b_2 = b^2)
tmp2 = select(tmp1, ends_with("2"))
tmp3 = rowMeans(tmp2)
tmp4 = mutate(tmp1, m=tmp3)

Or, with less intermediate steps:

tmp1 = mutate(my_df, a_2 = a^2, b_2 = b^2)
tmp4 = mutate(tmp1, m=rowMeans(select(tmp1, ends_with("2"))) )

Note that computing tmp4 requires using tmp1 twice. So in the piped version you will need also to reference . explicitly a second time (as usual the first reference is implicit, as the first argument to mutate):

my_df %>%
  mutate(a_2 = a^2, b_2 = b^2) %>%
  mutate(mean = rowMeans(select(., ends_with("2"))) )

For problem #2: avoiding the call rowMeans is trickier, and maybe not desirable (?)

Pierre Gramme
  • 1,209
  • 7
  • 23
0

Fortunately, since dplyr > 1.0.0 there is a dplyr-way to do exactly what you were looking for by using c_across. This is helpful because it extends the solution to other functions that may have a Row implementation like RowMeans().

Try this:

my_df %>%
  mutate(
    a_2 = a^2,
    b_2 = b^2,
    ) %>% 
  rowwise() %>% 
  mutate( mean = mean(c_across(ends_with("2"))) )
Nico Rojas
  • 43
  • 6