0

I'm having trouble getting r's dplyr::arrange() to sort properly when used in a for loop. I found many posts discussing this issue (like ex.1 with the .by_group=TRUE and using desc() bettter, ex.2 with lists, and ex.3 with filter_all() and %in%). Yet, I'm still having a bit of trouble understanding why I can get the arrange() to work when I use the column name directly but not when I refer to its index position within a vector, which will later be used in a loop to aid data extraction from a larger dataframe.

Here is a reproducible toy data to demonstrate:

set.seed(1) 
toy <- data.frame(a=rep(sample(letters[1:5], 4, TRUE)), tf=sample(c("T","F"), 100, TRUE), n1=sample(1:100, 100, TRUE), n2=1:100)
get_it <- colnames(toy)[3:4]

My initial approach so far works with the indexed vector on the select() portion, but fails to sort on the arrange() even with the .by_group option. I also tried adding dplyr::arrange() but not change.

j=1  # pretending this is the 1st pass in the loop
toy %>% 
  select(a, tf, get_it[j]) %>% 
  group_by(a) %>% 
  arrange(desc(get_it[j]), .by_group=TRUE)

   a     tf     n1
<chr>  <chr>  <int>
   a      T     21
   a      T     17
   a      F     87
   a      T     90
   a      T     64  

example output truncated

However, I get the intended sorted results when I switch the indexed vector in the arrange() for the same name of the column (select still works fine):

j=1  # pretending this is the 1st pass through the loop
toy %>% 
  select(a, tf, get_it[j]) %>% 
  group_by(a) %>% 
  arrange(desc(n1), .by_group=TRUE)

   a     tf     n1
<chr>  <chr>  <int>
   a      F     99
   a      F     98
   a      F     96
   a      F     95
   a      T     93  

example output truncated

Why does the second version work, but not the first? What should I change so that I can loop this through many columns?
Thanks in advance! I appreciate your time!

(minor edit to correct a typo.)

Shawn Janzen
  • 369
  • 3
  • 15
  • 1
    You need to look at [programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html). `desc(get_it[j])` is going to sort descending on a static string, not on the values of the column suggested by that string. By sorting on a static string, the order (assuming a natural/stable sort) will be unchanged. – r2evans May 06 '22 at 18:26

1 Answers1

1

This is "programming with dplyr", use .data for referencing columns by a string:

toy %>% 
  select(a, tf, get_it[j]) %>% 
  group_by(a) %>% 
  arrange(desc(.data[[ get_it[j] ]]), .by_group=TRUE)
# # A tibble: 100 x 3
# # Groups:   a [3]
#    a     tf       n1
#    <chr> <chr> <int>
#  1 a     F        99
#  2 a     F        98
#  3 a     F        96
#  4 a     F        95
#  5 a     T        93
#  6 a     T        92
#  7 a     T        92
#  8 a     T        90
#  9 a     F        87
# 10 a     F        86
# # ... with 90 more rows
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for the solution portion too! I was reading the link you shared when you posted this. It's counter-intuitive for me as to why it would be necessary to have this sort of difference, but it is what it is. Thanks again! – Shawn Janzen May 06 '22 at 18:55
  • 1
    Perhaps read https://stackoverflow.com/a/63399340/3358272, a discussion of the dplyr "pronouns" `.data` and `.env`. There are many places where non-standard evaluation (NSE) such as used in dplyr can become ambiguous, at least to the package/functions, and what may be obvious/intuitive to the reader may not be as clear to the function. – r2evans May 06 '22 at 19:29
  • 1
    To muddy the waters, there is also `cur_data()`, which is different than `.data` in some situations. See https://stackoverflow.com/q/70465515/3358272. – r2evans May 06 '22 at 19:30