Sort 2 columns simultaneously based on the group name

Question

Dataset

> read.delim("df.tsv")
   col1 col2 group
1     3    2    aa
2     1    1    aa
3     4    1    aa
4     4    3    aa
5     5    3    ab
6     3    2    ab
7     4    1    ab
8     2    4    ab
9     4    2    ba
10    1    4    ba
11    3    1    ba
12    4    3    ba
13    4    2    bb
14    2    3    bb
15    3    1    bb
16    1    2    bb

I want to sort the columns col1 and col2 within each of the 4 groups, in the following way:

If the 1st character in the group name is "a", sort col1 in a descending manner, and ascending if it's "b"
If the 2nd character in the group name is "a", sort col2 in a descending manner, and ascending if it's "b"
Importantly, I would like that both columns are sorted simultaneously, i.e. if e.g. the group is "aa", the sorting for that group should look like this:

   col1 col2 group
1     4    3    aa
2     3    2    aa
3     4    1    aa
4     1    1    aa
...

This could be accomplished by e.g. a "one row at a time" approach, first col1 and then col2, alternating for each row.

Current code and output

library(dplyr)

read.delim("df.tsv") %>%
  group_by(group) %>%
  arrange(ifelse(substr(group, 1,1) == "a", desc(col1), col1), # if first character in group name is "a", sort col1 in a descending manner, and ascending if it's "b"
          ifelse(substr(group, 2,2) == "a", desc(col2), col2), # if second character in group name is also "a", sort also col2 in a descending manner, and ascending if it's "b"
          .by_group = TRUE)

    col1  col2 group
 1     4     3 aa   
 2     4     1 aa   
 3     3     2 aa   
 4     1     1 aa   
 5     5     3 ab   
 6     4     1 ab   
 7     3     2 ab   
 8     2     4 ab   
 9     1     4 ba   
10     3     1 ba   
11     4     3 ba   
12     4     2 ba   
13     1     2 bb   
14     2     3 bb   
15     3     1 bb   
16     4     2 bb

However, this does not fulfill the 3rd criterion, the "simultaneous sorting one row at a time".

Desired output

    col1  col2 group
 1     4     3 aa   
 2     3     2 aa   
 3     4     1 aa   
 4     1     1 aa   
 5     5     3 ab   
 6     4     1 ab   
 7     3     2 ab   
 8     2     4 ab   
 9     1     4 ba   
10     4     3 ba   
11     3     1 ba   
12     4     2 ba   
13     1     2 bb   
14     3     1 bb   
15     2     3 bb   
16     4     2 bb

EDIT

There are a couple of answers that actually do the proposed task, so I think a tie-breaker could be that the algorithm is flexible with respect to the number of columns to sort, e.g. 3:

col1    col2    col3    group
3   2   4   aaa
1   1   2   aaa
4   1   4   aaa
4   3   1   aaa
5   3   3   aab
3   2   2   aab
4   1   1   aab
2   4   1   aab
4   2   3   aba
1   4   3   aba
3   1   2   aba
4   3   3   aba
3   2   4   abb
1   1   2   abb
4   1   4   abb
4   3   1   abb
4   2   1   baa
2   3   2   baa
3   1   2   baa
1   2   1   baa
5   3   3   bab
3   2   2   bab
4   1   1   bab
2   4   1   bab
4   2   3   bba
1   4   3   bba
3   1   2   bba
4   3   3   bba
4   2   1   bbb
2   3   2   bbb
3   1   2   bbb
1   2   1   bbb

The output should be

col1    col2    col3    group
4   3   1   aaa
3   2   4   aaa
4   1   4   aaa
1   1   2   aaa
5   3   3   aab
2   4   1   aab
4   1   1   aab
3   2   2   aab
4   2   3   aba
3   1   2   aba
4   3   3   aba
1   4   3   aba
4   1   4   abb
1   1   2   abb
4   3   1   abb
3   2   4   abb
1   2   1   baa
2   3   2   baa
3   1   2   baa
4   2   1   baa
2   4   1   bab
5   3   3   bab
4   1   1   bab
3   2   2   bab
1   4   3   bba
3   1   2   bba
4   2   3   bba
4   3   3   bba
1   2   1   bbb
3   1   2   bbb
4   2   1   bbb
2   3   2   bbb

Currently the 2 suggested solutions do not work when 3 or more columns are included, they sort based on only 2 columns.

EDIT 2

If e.g. group=='aba', the first row of this group should be the one that includes the highest value in col1; the 2nd row the one that includes the (remaining) lowest value in col2; the 3rd row the one that includes the (remaining) highest value in col3, and the 4th row is the remaining row. However, this should be flexible to allow for more than 4 rows per group, in that case the 4th row should be the one that includes the (remaining) highest value in col1; the 5th row should be the one that includes the (remaining) lowest value in col2; etc.

More details

Example: For the 2nd row of the 'aba' group, in the case that there is a tie between 2 rows for the lowest (remaining) value in col2, e.g.

row-a 3 1 4 aba
row-b 2 1 4 aba

(notice that there is a 1 in col2 in both rows), ideally then the chosen 2nd row would be row-a, since the col1 has to be sorted in a descending manner ('a') in this group, and 3>2, and for col3 4==4 anyway.

If instead

row-a 3 1 4 aba
row-b 2 1 5 aba

let the priority go col3>col2>col1, since the cycle goes col1>col2>col3... so the 2nd row would be row-b, since 5>4.

So to generalize, if there were 5 columns and the group were 'aabaa', and there is a tie for choosing the 3rd row between 2 rows:

row-a 3 2 1 3 3 aabaa
row-b 5 4 1 4 2 aabaa

(col3 == 1 in both), then the one to select would be row-a since for col5 3>2. If instead

row-a--> 3 2 1 3 3
row-b--> 5 4 1 4 3

(col5==3 in both), then choose row-b since for col4 4>3.

I'm not sure I understand why col1 should be `4 3 4` and not `4 4 3`? — Maël, Jan 21 '22 at 13:36
because if col1 is 4 4 3 means that col2 is 3 1 2, and what I would like is that 1st row has the highest value in col1, then row2 has the remaining highest value in col2, then row3 the remaining highest value in col1 — Miguel, Jan 21 '22 at 13:40

score 4 · Answer 1 · edited Oct 15 '22 at 08:36

On second thought, I think I can just pass that option to you. You can specify any cycling method you want now.

alt_order <- function(..., type, cyc) {
  cols <- unname(list(...))
  stopifnot(
    # sanity checks; you may skip if you think they are unnecessary
    length(unique(lengths(cols))) == 1L,
    length(cols) == length(type),
    all(unlist(type) %in% c(1L, -1L))
  ) 
  cols <- mapply(`*`, cols, type, SIMPLIFY = FALSE)
  out <- integer(length(cols[[1L]]))
  this <- cols
  for (i in seq_along(out)) {
    out[[i]] <- do.call(order, this)[[1L]]
    cols <- lapply(cols, `is.na<-`, out[[i]])
    this <- cols[cyc(i)]
  }
  out
}

cyc should be a function that accepts a single integer as input and returns a vector of integers. For example, if you have 3 columns and you want to replicate the rev cycling behavior as I described in the comment below, you can do this

mycyc <- function(i) list(1:3, 3:1)[[(i - 1) %% 2L + 1L]]
df %>% group_by(group) %>% slice(alt_order(col1, col2, col3, type = ab2sign(group), cyc = mycyc))

Well, perhaps a not efficient but simple solution is to just sort the two columns continuously, swap the major column each time, and discharge the first element until no element is left to be sorted. Here is the function.

alt_order <- function(..., type) {
  cols <- unname(list(...))
  stopifnot(
    # sanity checks; you may skip if you think they are unnecessary
    length(unique(lengths(cols))) == 1L,
    length(cols) == length(type),
    all(unlist(type) %in% c(1L, -1L))
  ) 
  cols <- mapply(`*`, cols, type, SIMPLIFY = FALSE)
  out <- integer(length(cols[[1L]]))
  for (i in seq_along(out)) {
    out[[i]] <- do.call(order, cols)[[1L]]
    cols <- rev(lapply(cols, `is.na<-`, out[[i]]))
  }
  out
}

We assign values to NAs to discharge them since NAs will be sorted to the last in an ascending way. type should be either 1 or -1 and is used to streamline the order we would like to impose since the descending order of c(1,2,3) is the same as the ascending order of -1 * c(1,2,3). We also need a helper function as follows to transfer your groups into 1 and -1

ab2sign <- function(x) {
  out <- data.table::transpose(strsplit(x, "", fixed = TRUE))
  lapply(out, function(x) 2L * (x == "b") - 1L)
}

Now we can apply them

df %>% group_by(group) %>% slice(alt_order(col1, col2, type = ab2sign(group)))

Output

# A tibble: 16 x 3
# Groups:   group [4]
    col1  col2 group
   <int> <int> <chr>
 1     4     3 aa   
 2     3     2 aa   
 3     4     1 aa   
 4     1     1 aa   
 5     5     3 ab   
 6     4     1 ab   
 7     3     2 ab   
 8     2     4 ab   
 9     1     4 ba   
10     4     3 ba   
11     3     1 ba   
12     4     2 ba   
13     1     2 bb   
14     3     1 bb   
15     2     3 bb   
16     4     2 bb

Perhaps more efficient (perhaps vectorized) solutions are available.

Thanks, this does the proposed task, but please check the edit — Miguel, Jan 24 '22 at 23:57
I am not sure about how to code. We only know that for the first round you would like to sort primarily based on col1, then col2, and col3 (123 basically), but how about later rounds? Keep the original order, move the used one to the last, cycle if needed, `123 -> 213 -> 312 -> 123 -> 213 -> ...`? Or move forward next unused, cycle if needed, no restart `123 -> 213 -> 321 -> 132 -> 213 -> 321 -> ...`? Or with restart `123 -> 213 -> 321 -> 123 -> 213 -> 321 -> ...`? Too many possibilities. It is not immediately obvious to me how you want to do this. May you please clarify? @Miguel — ekoam, Jan 25 '22 at 05:55
My current approach follows an analogy from the two-column case. Basically, `123 -> 321 -> 123 -> 321 -> 123 -> ...` @Miguel — ekoam, Jan 25 '22 at 06:06
Thanks, please see edit2; which should be the cycle to do what's described in there? — Miguel, Jan 25 '22 at 10:05
Still confused. Consider all four new examples you posted. For the first one, we start by comparing col2 > 1 vs 1 > undetermined > compare col2 > 4 vs 4 > undetermined > finished all cols > cycle to the col1 > 3 vs 2 > determined (row-a to be moved ahead of row-b since group 'a'). I can use this logic for the second new example without a problem. However, for the third and fourth examples, why do we skip col4? I thought it should be: start by comparing col3 > 1 vs 1 > undetermined > compare col4 > 3 vs 4 > determined (row-b to be moved ahead of row-a since group 'a')? @Miguel — ekoam, Jan 25 '22 at 12:15

ThomasIsCoding · Accepted Answer · 2022-01-25T12:19:46.933

2

Update

Below is an option may work in general cases, i.e., more than 2 columns:

f <- function(.) {
  col <- .[-length(.)] * (2 * (t(list2DF(strsplit(.$group, ""))) == "b") - 1)
  r <- data.frame()
  while (nrow(.)) {
    p <- do.call(order, col[(seq_along(col) + nrow(r) - 1) %% length(col) + 1])[1]
    r <- rbind(r, .[p, ])
    col <- col[-p, ]
    . <- .[-p, ]
  }
  r
}

df %>%
  group_by(group) %>%
  do(f(.)) %>%
  ungroup()

which gives

    col1  col2  col3 group
   <int> <int> <int> <chr>
 1     4     3     1 aaa
 2     3     2     4 aaa
 3     4     1     4 aaa
 4     1     1     2 aaa
 5     5     3     3 aab
 6     2     4     1 aab
 7     4     1     1 aab
 8     3     2     2 aab
 9     4     2     3 aba
10     3     1     2 aba
# ... with 22 more rows

Here is an option using dynamic programming (but maybe not that efficient)

f <- function(.) {
  col <- with(., data.frame(col1, col2) * (2 * (t(list2DF(strsplit(.$group, ""))) == "b") - 1))
  r <- data.frame()
  while (nrow(.)) {
    p <- do.call(order, ifelse(nrow(r) %% 2, rev, I)(col))[1]
    r <- rbind(r, .[p, ])
    col <- col[-p,]
    . <- .[-p, ]
  }
  r
}

df %>%
  group_by(group) %>%
  do(f(.)) %>%
  ungroup()

which gives

# A tibble: 16 x 3
    col1  col2 group
   <int> <int> <chr>
 1     4     3 aa
 2     3     2 aa
 3     4     1 aa
 4     1     1 aa
 5     5     3 ab
 6     4     1 ab
 7     3     2 ab
 8     2     4 ab
 9     1     4 ba
10     4     3 ba
11     3     1 ba
12     4     2 ba
13     1     2 bb
14     3     1 bb
15     2     3 bb
16     4     2 bb

edited Jan 25 '22 at 12:19

answered Jan 24 '22 at 21:44

ThomasIsCoding

96,636
9
24
81

Thanks, this does the proposed task, but please check the edit – Miguel Jan 24 '22 at 23:58
@Miguel Then what's is the logic for the case with 3 columns? – ThomasIsCoding Jan 25 '22 at 09:41
If e.g. group=='aba', the first row of this group should be the one that includes the highest value in col1; the 2nd row the one that includes the (remaining) lowest value in col2; the 3rd row the one that includes the (remaining) highest value in col3, and the 4th row is the remaining row. However, this should be flexible to allow for more than 4 rows per group, in that case the 4th row should be the one that includes the (remaining) highest value in col1; the 5th row should be the one that includes the (remaining) lowest value in col2; etc. – Miguel Jan 25 '22 at 10:02
@Miguel Thanks. Then, for the 2nd row, we find the lowest value in col2 among remaining values, but what is the priority for col1 and col3? Should the priorities rank like col2 > col1 > col3? – ThomasIsCoding Jan 25 '22 at 10:08
For the 2nd row, the top priority is col2. In the case that there is a tie between 2 rows in col2 for the lowest (remaining) value, e.g. row-a --> 3 1 4 ; row-b --> 2 1 4 (notice that there is a 1 in col2 in both rows), ideally then the chosen row would be row-a, since the col1 has to be sorted in a descending manner ('a') in this group, and 3>2, and for col3 4==4 anyway. However this is not crucial. – Miguel Jan 25 '22 at 10:14
@Miguel what if row-a --> 3 1 4 ; row-b --> 2 1 5? – ThomasIsCoding Jan 25 '22 at 10:20
Again this is not crucial, but in that case, let the priority go col3>col2>col1, since the cycle goes col1>col2>col3... so the row of choice would be row-b since 5>4. So to generalize, if there were 5 columns and the group is aabaa, and there is a tie for the 3rd row between row-a--> 3 2 1 3 3 and row-b--> 5 4 1 4 2 (col3 ==1 in both), then the one to select would be row-a since for col5 3>2. If row-a--> 3 2 1 3 3 and row-b--> 5 4 1 4 3 (col5==3 in both), then choose row-b since for col4 4>3. – Miguel Jan 25 '22 at 10:31
@Miguel See my update – ThomasIsCoding Jan 25 '22 at 12:20

score -2 · Answer 3 · edited Jan 23 '22 at 00:00

-2

I can tell you the answer, but I cannot write complete r code for it, since I don't know r, I hope some one may edit my code for a complete answer.

suppose both sorts are ascending (you can generalize it to your case)

idx1=order(col1)
idx2=order(col2[idx1])

return col1[idx1[idx2]], col2[idx1[idx2]]

edited Jan 23 '22 at 00:00

Ihor Konovalenko

1,298
2
16
21

answered Jan 22 '22 at 10:59

jumper

5
1

Thanks, but unfortunately it's not obvious to me how your code should be applied to R to fulfill the task – Miguel Jan 24 '22 at 13:35
2

I tried this code in **R** but it's not working for me. – Erfan Ghasemi Jan 24 '22 at 14:17