1

I have a tibble with 5 columns. The first three are binary, the fourth is integer (assume, only 1 or 2), and I want it ordered with particular order of values:

1234
----
0001
0002
1001
1002
0101
0102
0011
0012
1101
1102
1011
1012
0111
0112
1111
1112

Finally, the fifth variable is just an outcome and I do not care about the order.

Can dplyr::arrange() by instructed to reproduce this sorting?

user438383
  • 5,716
  • 8
  • 28
  • 43
GiulioGCantone
  • 195
  • 1
  • 10
  • 6
    What's the logic here? Is the order completely arbitrary? These are separate columns in your actual data? You would probably want to create a function to transform your columns into a single numeric value that sorts in the "correct" way It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Use a `dput()` so it's clear how your data is structured. – MrFlick Feb 21 '23 at 16:45
  • If those are all 16 possibilities, you can sort other data that way by joining it to this data. – Gregor Thomas Feb 21 '23 at 16:46
  • The order is arbitrary. I can try to put it this way: - 0 comes always before 1, within the same column - 4rth column comes always before 1 of the first three, but not before 0. - 0 of the 3rd comes before 0 of the 2nd, which comes before 0 of the first. A better way to think about it: the first three are absence/presence of inputs, the fourth is time. If you think about it, the scheme is clear. – GiulioGCantone Feb 21 '23 at 16:54
  • 2
    @GiulioGCantone for you it is clear because you have the data, you can play with it and you know what it means. But it's not that easy to understand from the outside. Instead of only explaining with words, I think it would be clearer for everyone if you showed a minimal and reproducible example with the expected output (take a look at [reprex](https://reprex.tidyverse.org/)) – bretauv Feb 21 '23 at 17:31
  • 1
    *"If you think about it, the scheme is clear"* directly contradicts *"The order is arbitrary"*. – Gregor Thomas Feb 21 '23 at 19:56

2 Answers2

2

If your ordering is arbitrary and corresponds to the shown input in the OP, then we can create a variable on the fly inside dplyr::arrange() make it a factor and assign the defined sorting as levels.

What we basically do is:

  1. we select the first four columns with across and turn them into character
  2. unite them into one column without space sep =""
  3. and then access this column with pull to
  4. make it a factor and assign the levels from our custom sorting.

Note that the output doesn't contain all the possible orderings, because the random data is, well, random. But the data that appears is ordered according to our levels defined in sorting.

library(tidyverse)

set.seed(123)

df <- tibble(col1 = sample(c(0, 1), replace=TRUE, size=20), 
             col2 = sample(c(0, 1), replace=TRUE, size=20),
             col3 = sample(c(0, 1), replace=TRUE, size=20),
             col4 = sample(c(1,2), replace=TRUE, size=20),
             col5 = sample(c("A", "B", "C"), replace=TRUE, size=20))

sorting <- readr::read_lines("0001
0002
1001
1002
0101
0102
0011
0012
1101
1102
1011
1012
0111
0112
1111
1112
")

df %>% 
  arrange(across(col1:col4, as.character) %>%
            unite(., col = "ord", sep = "") %>%
            pull(ord) %>% 
            factor(., levels = sorting))

#> # A tibble: 20 x 5
#>     col1  col2  col3  col4 col5 
#>    <dbl> <dbl> <dbl> <dbl> <chr>
#>  1     0     0     0     1 B    
#>  2     0     0     0     1 B    
#>  3     0     0     0     1 B    
#>  4     0     0     0     1 A    
#>  5     1     0     0     1 B    
#>  6     1     0     0     1 A    
#>  7     1     0     0     2 C    
#>  8     0     1     0     1 B    
#>  9     0     1     0     1 A    
#> 10     0     1     0     2 A    
#> 11     0     0     1     2 C    
#> 12     0     0     1     2 C    
#> 13     0     0     1     2 B    
#> 14     1     1     0     1 C    
#> 15     1     1     0     1 B    
#> 16     1     1     0     1 B    
#> 17     1     0     1     2 A    
#> 18     1     0     1     2 C    
#> 19     0     1     1     2 B    
#> 20     1     1     1     1 A

Created on 2023-02-21 by the reprex package (v2.0.1)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
TimTeaFan
  • 17,549
  • 4
  • 18
  • 39
1

It's an interesting problem. The starting point maybe to convert your order into a binary representation, do the same for your data. Merge the two and then arrange.

If I've understood correctly here is a subset example that shows what I mean:

library(tidyverse)

df <- tibble(col1 = sample(c(FALSE, TRUE), replace=TRUE, size=20), 
             col2 = sample(c(FALSE, TRUE), replace=TRUE, size=20),
             col3 = sample(c(FALSE, TRUE), replace=TRUE, size=20),
             col4 = sample(c(1,2), replace=TRUE, size=20),
             col5 = sample(c("A", "B", "C"), replace=TRUE, size=20))
  

test_df <- tibble(tests = c("FFF1", "FFF2", "TFF1", "TFF2", "TTF1", "TTF2", "TFT1", "TFT2"))


test_df <- test_df %>% 
  rowwise() %>% 
  mutate(pos = 0) %>% 
  mutate(pos = ifelse(substring(tests, 1, 1) == "T", pos + 1, pos)) %>% 
  mutate(pos = ifelse(substring(tests, 2, 2) == "T", pos + 2, pos)) %>% 
  mutate(pos = ifelse(substring(tests, 3, 3) == "T", pos + 4, pos)) %>% 
  mutate(pos = ifelse(substring(tests, 4, 4) == 2, pos + 8, pos))
  
df <- df %>% 
  rowwise() %>% 
  mutate(pos = 0) %>% 
  mutate(pos = ifelse(col1 == TRUE, pos + 1, pos)) %>% 
  mutate(pos = ifelse(col2 == TRUE, pos + 2, pos)) %>% 
  mutate(pos = ifelse(col3 == TRUE, pos + 4, pos)) %>% 
  mutate(pos = ifelse(col4 == 2,    pos + 8, pos))

left_join(test_df, df, by = join_by(pos == pos)) %>% 
  arrange(pos) %>% 
  select(-pos, -tests) %>%
na.omit()

Essentially this builds the dataframe and the tests. Then adds a column pos for each test and then left_joins them before arranging.

Not very pretty, but should give what you're looking for?

  col1  col2  col3   col4 col5 
  <lgl> <lgl> <lgl> <dbl> <chr>
1 FALSE FALSE FALSE     1 B    
2 TRUE  FALSE FALSE     1 A    
3 TRUE  TRUE  FALSE     1 B    
4 TRUE  TRUE  FALSE     1 B    
5 TRUE  TRUE  FALSE     1 B    
6 TRUE  FALSE TRUE      1 B    
7 TRUE  TRUE  FALSE     2 B 
Tech Commodities
  • 1,884
  • 6
  • 13