8

I would like to obtain a generic formula to arrange dataframes with a varying number of columns.

For example, in this case the dataframe contains "categ_1, categ_2, points_1, points_2":

  library(tidyverse)
  set.seed(1)
  nrows <- 20
  df <- tibble(
    other_text = sample(letters,
                        nrows, replace = TRUE),
    categ_1 = sample(c("A", "B"), nrows, replace = TRUE),
    categ_2 = sample(c("A", "B"), nrows, replace = TRUE),
    points_1 = sample(20:25, nrows, replace = TRUE),
    points_2 = sample(20:25, nrows, replace = TRUE),
  ) %>%
    rowwise() %>%
    mutate(total = sum(c_across(starts_with("points_")))) %>%
    ungroup()

And the formula to arrange:

df %>%
  arrange(
    desc(total),
    categ_1, categ_2,
    desc(points_1), desc(points_2)
  )

But df could have more columns: "categ_1, categ_2, categ_3, points_1, points_2, points_3". So, in that case, the formula should be:

df %>%
  mutate(
    categ_3 = sample(c("A", "B"), nrows, replace = TRUE),
    points_3 = sample(20:25, nrows, replace = TRUE),
  ) %>%
    rowwise() %>%
    mutate(total = sum(c_across(starts_with("points_")))) %>%
    ungroup() %>%
    arrange(
      desc(total),
      categ_1, categ_2, categ_3,
      desc(points_1), desc(points_2), desc(points_3)
    )

I tried writing a generic formula (using across):

  library(daff)

  daff::diff_data(
    df %>%
      arrange(
        desc(total),
        categ_1, categ_2,
        desc(points_1), desc(points_2)
      )
    ,
    df %>%
      arrange(
        desc(total),
        across(starts_with("categ_")),
        across(starts_with("points_"), desc)
      )
  )
#> Daff Comparison: ‘df %>% arrange(desc(total), categ_1, categ_2, desc(points_1), ’ ‘    desc(points_2))’ vs. ‘df %>% arrange(desc(total), across(starts_with("categ_")), across(starts_with("points_"), ’ ‘    desc))’
#>           A:A        B:B     ... E:E      F:F
#>       @@  other_text categ_1 ... points_2 total
#>       ... ...        ...     ... ...      ...
#> 10:9      z          A       ... 23       45
#> 9:10  :   v          A       ... 22       45
#> 11:11     s          B       ... 23       45
#>       ... ...        ...     ... ...      ...

It seems like a bug in arrange: arrange only considers the parameters until the first across.

I also tried writing the conditions inside a case_when but couldn't find the correct syntax:

  # not working
  df %>%
    arrange(
      across(everything(), ~ case_when(
        . == "total" ~ .,
        str_detect(., "categ_") ~ .,
        str_detect(., "points_") ~ desc(.),
        TRUE ~ 1
      )
      )
    )
#> Error in `arrange()`:
#> ! Problem with the implicit `transmute()` step.

What would be the generic way of writing that formula inside arrange? (Other alternatives are welcome but I would prefer a tidyverse solution.)

bretauv
  • 7,756
  • 2
  • 20
  • 57
crestor
  • 1,388
  • 8
  • 21
  • Can you check you're using `dplyr::arrange()` and not `plyr::arrange()` by accident? The order they were loaded matters if you don't specify the package. – Paul Stafford Allen Oct 04 '22 at 07:16
  • Have you tried wrapping the second `across` inside a `desc()` call instead of using it as a further variable? i.e. `df %>% arrange( desc(total), across(starts_with("categ_")), desc(across(starts_with("points_"))) )` – Paul Stafford Allen Oct 04 '22 at 07:33
  • 1
    @Paul Yes, I have also tried `desc(across(starts_with("points_"))` but it's the same. `arrange`only takes in consideration until the first `across` function, not any other parameters after that. – crestor Oct 04 '22 at 07:43
  • 1
    @Paul I load the package *tidyverse*. So the package *plyr* is not even loaded. (It's not included inside the *tidyverse* collection of packages.) – crestor Oct 04 '22 at 07:46
  • Same here, your across calls inside `arrange()` work fine for me. I'm on dplyr v. 1.0.7 – TimTeaFan Oct 04 '22 at 08:24
  • @thothal and @TimTeaFan: The formula doesn't return an error. But it doesn't sort correctly. You can use `daff::dif_data` to compare both dataframes and see the differences. – crestor Oct 04 '22 at 08:43
  • 2
    For those finding that it runs fine in other versions of `dplyr`, try changing the seed and starting again. I think the seed above in some versions produces random values which just happen to be in the right order from the start. Would also suggest in example changing to `replace = FALSE` in `other_text` to make this a unique id row (to help see differences in order?) – Andy Baxter Oct 04 '22 at 09:12
  • 1
    Ok, good catchm by changing the seed I see now also differences (`dplyr_1.0.9`) – thothal Oct 04 '22 at 11:33

3 Answers3

5

You could try just wrapping everything in the arrange() in a data frame. Looks like arrange() does some code manipulation to handle top-level desc() calls in a special manner, which has a bad interaction with the data frames created by across(). But that can be avoided using the data frame unpacking feature.

library(tidyverse)

set.seed(3)
nrows <- 20

df <- tibble(
  other_text = sample(letters, nrows, replace = TRUE),
  categ_1 = sample(c("A", "B"), nrows, replace = TRUE),
  categ_2 = sample(c("A", "B"), nrows, replace = TRUE),
  points_1 = sample(20:25, nrows, replace = TRUE),
  points_2 = sample(20:25, nrows, replace = TRUE),
) %>%
  rowwise() %>%
  mutate(total = sum(c_across(starts_with("points_")))) %>%
  ungroup()

identical(
  df %>%
    arrange(
      desc(total),
      categ_1, categ_2,
      desc(points_1), desc(points_2)
    ),
  df %>%
    arrange(
      tibble(
        desc(total),
        across(starts_with("categ_")),
        across(starts_with("points_"), desc)
      )
    )
)
#> [1] TRUE
Mikko Marttila
  • 10,972
  • 18
  • 31
  • Surprising alternative! Could be taken into account for test cases (when suspecting that `arrange` could have a bug or regression). – crestor Oct 05 '22 at 06:28
4

An easy (sort of) solution would be to use cur_column() to determine the column name which eventually determines the sorting order:

Data

library(dplyr)
library(daff)

set.seed(32211)
nrows <- 20

df <- tibble(
   other_text = sample(letters,
                       nrows, replace = TRUE),
   categ_1 = sample(c("A", "B"), nrows, replace = TRUE),
   categ_2 = sample(c("A", "B"), nrows, replace = TRUE),
   points_1 = sample(20:25, nrows, replace = TRUE),
   points_2 = sample(20:25, nrows, replace = TRUE),
) %>%
   rowwise() %>%
   mutate(total = sum(c_across(starts_with("points_")))) %>%
   ungroup()

Comparison

by_hand <- df %>%
   arrange(
      desc(total),
      categ_1, categ_2,
      desc(points_1), desc(points_2)
   )

with_across <- df %>%
   arrange(
      across(
         c(total, starts_with("categ_"), starts_with("points_")),
         .fns = function(x) {
            if (grepl("^total$|^points_.*$", cur_column()))
               desc(x)
            else
               x
         })
   )

diff_data(by_hand, with_across)
# Daff Comparison: ‘by_hand’ vs. ‘with_across’ 
#      other_text categ_1 categ_2 points_1 points_2 total
thothal
  • 16,690
  • 3
  • 36
  • 71
2

Latest, super-simple fix

Install development version:

# remotes::install_github("tidyverse/dplyr")
library(tidyverse)

set.seed(144)
nrows <- 20
df <- tibble(
  other_text = sample(letters,
                      nrows, replace = FALSE),
  categ_1 = sample(c("A", "B"), nrows, replace = TRUE),
  categ_2 = sample(c("A", "B"), nrows, replace = TRUE),
  points_1 = sample(1:25, nrows, replace = FALSE),
  points_2 = sample(100:125, nrows, replace = FALSE),
) %>%
  rowwise() %>%
  mutate(total = sum(c_across(starts_with("points_")))) %>%
  ungroup()

out1 <- df %>%
  arrange(
    desc(total),
    categ_1, categ_2,
    desc(points_1), desc(points_2)
  )

out2 <- df %>%
  arrange(
    desc(total),
    across(starts_with("categ_")),
    across(starts_with("points_"), desc)
  )

daff::diff_data(out1, out2)
#> Daff Comparison: 'out1' vs. 'out2' 
#>      other_text categ_1 ...
Andy Baxter
  • 5,833
  • 1
  • 8
  • 22
  • 1
    Thanks for opening the github issue. That was also my intention. I was waiting to obtain some confirmatory comments here about considering it a bug. – crestor Oct 04 '22 at 12:02
  • 1
    apologies for stealing your thunder! Happy to take down if you would rather report with your own example data. Also playing around a bit more I think my above solution doesn't quite fully work. I think the other solution seems clearer, tidier and more robust! – Andy Baxter Oct 04 '22 at 12:07
  • Turns out someone had noticed it before: https://github.com/tidyverse/dplyr/issues/6490. Hope that fixes things. – Andy Baxter Oct 04 '22 at 15:08
  • 1
    Thanks for the mediation on github. – crestor Oct 05 '22 at 06:43