7

This is somehow related to this question: In principle I try to understand how rowwise operations with mutate across multiple columns applying more then 1 functions like (mean(), sum(), min() etc..) work.

I have learned that across does this job and not c_across. I have learned that the function mean() is different to the function min() in that way that mean() doesn't work on dataframes and we need to change it to vector which can be done with unlist or as.matrix -> learned from Ronak Shah hereUnderstanding rowwise() and c_across()

Now with my actual case: I was able to do this task but I loose one column d. How can I avoid the loose of the column d in this setting.

My df:

df <- structure(list(a = 1:5, b = 6:10, c = 11:15, d = c("a", "b", 
"c", "d", "e"), e = 1:5), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

Works not:

df %>% 
  rowwise() %>% 
  mutate(across(a:e), 
         avg = mean(unlist(cur_data()), na.rm = TRUE),
         min = min(unlist(cur_data()), na.rm = TRUE), 
         max = max(unlist(cur_data()), na.rm = TRUE)
  )

# Output:
      a     b     c d         e   avg min   max  
  <int> <int> <int> <chr> <int> <dbl> <chr> <chr>
1     1     6    11 a         1    NA 1     a    
2     2     7    12 b         2    NA 12    b    
3     3     8    13 c         3    NA 13    c    
4     4     9    14 d         4    NA 14    d    
5     5    10    15 e         5    NA 10    e 

Works, but I loose column d:

df %>% 
  select(-d) %>% 
  rowwise() %>% 
  mutate(across(a:e), 
         avg = mean(unlist(cur_data()), na.rm = TRUE),
         min = min(unlist(cur_data()), na.rm = TRUE), 
         max = max(unlist(cur_data()), na.rm = TRUE)
  )

      a     b     c     e   avg   min   max
  <int> <int> <int> <int> <dbl> <dbl> <dbl>
1     1     6    11     1  4.75     1    11
2     2     7    12     2  5.75     2    12
3     3     8    13     3  6.75     3    13
4     4     9    14     4  7.75     4    14
5     5    10    15     5  8.75     5    15
TarJae
  • 72,363
  • 6
  • 19
  • 66

4 Answers4

7

Using pmap() from purrr might be more preferable since you need to select the data just once and you can use the select helpers:

df %>% 
 mutate(pmap_dfr(across(where(is.numeric)),
                 ~ data.frame(max = max(c(...)),
                              min = min(c(...)),
                              avg = mean(c(...)))))

      a     b     c d         e   max   min   avg
  <int> <int> <int> <chr> <int> <int> <int> <dbl>
1     1     6    11 a         1    11     1  4.75
2     2     7    12 b         2    12     2  5.75
3     3     8    13 c         3    13     3  6.75
4     4     9    14 d         4    14     4  7.75
5     5    10    15 e         5    15     5  8.75

Or with the addition of tidyr:

df %>% 
 mutate(res = pmap(across(where(is.numeric)),
                   ~ list(max = max(c(...)),
                          min = min(c(...)),
                          avg = mean(c(...))))) %>%
 unnest_wider(res)
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • Thank you tmfmnk. Could you please shortly comment on `c(...)`. – TarJae May 01 '21 at 16:35
  • 1
    see [this](https://stackoverflow.com/questions/67106591/why-ellipsis-have-to-be-wrapped-inside-c-when-used-in-lambda-functions-in) question. This may help – AnilGoyal May 01 '21 at 16:48
  • 1
    If usage of `purrr` is okay, nothing better than this. +1. – AnilGoyal May 01 '21 at 17:09
  • 1
    @TarJae These two might also help: https://stackoverflow.com/questions/67049561/using-pmap-with-c-part-2 https://stackoverflow.com/questions/67037099/using-pmap-function-to-check-whether-all-values-in-a-row-are-positive-or-negativ – Anoushiravan R May 01 '21 at 19:25
6

Edit:

Best way out here

df %>%
  rowwise() %>% 
  mutate(min = min(c_across(a:e & where(is.numeric)), na.rm = TRUE),
         max = max(c_across(a:e & where(is.numeric)), na.rm = TRUE), 
         avg = mean(c_across(a:e & where(is.numeric)), na.rm = TRUE)
  )

# A tibble: 5 x 8
# Rowwise: 
      a     b     c d         e   min   max   avg
  <int> <int> <int> <chr> <int> <int> <int> <dbl>
1     1     6    11 a         1     1    11  4.75
2     2     7    12 b         2     2    12  5.75
3     3     8    13 c         3     3    13  6.75
4     4     9    14 d         4     4    14  7.75
5     5    10    15 e         5     5    15  8.75

Earlier Answer Your this will work won't even work properly, if you change the output sequence, see

df %>% 
  select(-d) %>% 
  rowwise() %>% 
  mutate(across(a:e), 
         min = min(unlist(cur_data()), na.rm = TRUE),
         max = max(unlist(cur_data()), na.rm = TRUE), 
         avg = mean(unlist(cur_data()), na.rm = TRUE)
  )

# A tibble: 5 x 7
# Rowwise: 
      a     b     c     e   min   max   avg
  <int> <int> <int> <int> <int> <int> <dbl>
1     1     6    11     1     1    11  5.17
2     2     7    12     2     2    12  6.17
3     3     8    13     3     3    13  7.17
4     4     9    14     4     4    14  8.17
5     5    10    15     5     5    15  9.17

Therefore, it is advised to do it like this-

df %>% 
  select(-d) %>% 
  rowwise() %>% 
  mutate(min = min(c_across(a:e), na.rm = TRUE),
         max = max(c_across(a:e), na.rm = TRUE), 
         avg = mean(c_across(a:e), na.rm = TRUE)
  )

# A tibble: 5 x 7
# Rowwise: 
      a     b     c     e   min   max   avg
  <int> <int> <int> <int> <int> <int> <dbl>
1     1     6    11     1     1    11  4.75
2     2     7    12     2     2    12  5.75
3     3     8    13     3     3    13  6.75
4     4     9    14     4     4    14  7.75
5     5    10    15     5     5    15  8.75

One more alternative is

cols <- c('a', 'b', 'c', 'e')
df %>%
  rowwise() %>% 
  mutate(min = min(c_across(cols), na.rm = TRUE),
         max = max(c_across(cols), na.rm = TRUE), 
         avg = mean(c_across(cols), na.rm = TRUE)
  )

# A tibble: 5 x 8
# Rowwise: 
      a     b     c d         e   min   max   avg
  <int> <int> <int> <chr> <int> <int> <int> <dbl>
1     1     6    11 a         1     1    11  4.75
2     2     7    12 b         2     2    12  5.75
3     3     8    13 c         3     3    13  6.75
4     4     9    14 d         4     4    14  7.75
5     5    10    15 e         5     5    15  8.75

Even @Sinh suggested approach of group_by won't work properly in these cases.

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • Thank you very much AnilGoyal. I really appreciate your efforts and I understand all your thoughts and agree with you. But somehow I am not satisfied to repeat `c_across(a:e & where(is.numeric)), na.rm = TRUE)` three times. I imagine that it should be possible to prevent this repetition and use it once with `across`. – TarJae May 01 '21 at 16:22
  • TarJae, I fear what you are trying to do with `dplyr` only may not be possible, `mutate(across..` works in slightly different way. It mutates all the existing columns in it. Even if you use `cur_data` it will include newly added columns as shown above. Moreover, you still have to repeat `cur_data..bla..bla` again n number of times unlike `mutate(across`. This may however, be possible with many methods in `purrr` - one as shown in an answer. – AnilGoyal May 01 '21 at 16:55
2

Here is one method which would preserve the data.frame attribute in mutate if we want to set a particular column to row name attribute (column_to_rownames) and then return the attribute after the transformation

library(dplyr)
library(tibble)
library(purrr)
df %>% 
   column_to_rownames('d') %>%
   mutate(max = reduce(., pmax), min = reduce(., pmin), 
         avg = rowMeans(.)) %>% 
   rownames_to_column('d')
#  d a  b  c e max min  avg
#1 a 1  6 11 1  11   1 4.75
#2 b 2  7 12 2  12   2 5.75
#3 c 3  8 13 3  13   3 6.75
#4 d 4  9 14 4  14   4 7.75
#5 e 5 10 15 5  15   5 8.75
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Dear Arun, is it possible to apply all three function on each row using `invoke_map` or `exec`? – Anoushiravan R May 01 '21 at 19:29
  • Yes, thank you. I also included `mean` into the function list but since it is a column-wise operation it wouldn't get to the desired result. However `pmin` and `pmax` make it possible to get the max and min per rows because of their functionality. I was just curious. – Anoushiravan R May 01 '21 at 19:50
  • 1
    @AnoushiravanR the functions `pmin/pmax` and `mean` have different behavior. pmin/pmax acts on the rows in a parallel, where as `mean` expects a vector and it is not vectorized. You may expect `rowMeans` to work, but the arguments to that function is just a single one i.e. `x` which can be matrix or data.frame/tibble. where as in `pmax/pmin`, it is `...` ie. it can vary. So `invoke_map(list(pmax, pmin), list(df %>% select(-d)))` works as the arguments are are the data.frame columns – akrun May 01 '21 at 20:16
  • 1
    Thank you very much indeed. That argument matching is an important factor which I need to take into account in a moment like this. But glad to have finally familiarized myself with parallel maxima and minima. – Anoushiravan R May 01 '21 at 20:59
1

I think creating a row-wise tibble column creates a readable, elegant solution. The mean function takes a bit of extra work since it does not accept values with the ellipsis (...):

library(dplyr)

df |>
  rowwise() |>
  mutate(x = pick(where(is.numeric)),
         avg = mean(unlist(x)),
         min = min(x),
         max = max(x)) |>
  select(-x) |>
  ungroup()

Output

      a     b     c d         e   avg   min   max
  <int> <int> <int> <chr> <int> <dbl> <int> <int>
1     1     6    11 a         1  4.75     1    11
2     2     7    12 b         2  5.75     2    12
3     3     8    13 c         3  6.75     3    13
4     4     9    14 d         4  7.75     4    14
5     5    10    15 e         5  8.75     5    15
LMc
  • 12,577
  • 3
  • 31
  • 43