How to sort a character column by frequency of groups in R?

Question

I asked a question here, but upon review, it wasn't very clear and was asking the wrong questions. Im asking another question now to try and make more clear my issue.

In my data frame I have a column comprised of characters and NAs. Im trying to sort this column into groups based on their frequency, whilst preserving the structure of the groups. For example, if my data looks like this:

library(dplyr)

dfTest <- 
structure(list(var = c("x1", NA, NA, "x4", NA, "x4", NA, NA, 
"x5", NA, NA, "x1", NA, NA, "x5", NA, NA, "x4", NA, "x4", NA, 
NA, "x5", NA, NA, "x2", NA, NA, "x1", NA, "x2", NA, NA, "x5", 
NA, NA), nam = c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 
5, 5, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5
), itr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -36L))

> dfTest
# A tibble: 36 × 3
   var     nam   itr
   <chr> <dbl> <dbl>
 1 x1        1     1
 2 NA        1     1
 3 NA        1     1
 4 x4        2     1
 5 NA        2     1
 6 x4        2     1
 7 NA        2     1
 8 NA        2     1
 9 x5        3     1
10 NA        3     1
11 NA        3     1
12 x1        4     1
13 NA        4     1
14 NA        4     1
15 x5        5     1
16 NA        5     1
17 NA        5     1
18 x4        1     2
19 NA        1     2
20 x4        1     2
21 NA        1     2
22 NA        1     2
23 x5        2     2
24 NA        2     2
25 NA        2     2
26 x2        3     2
27 NA        3     2
28 NA        3     2
29 x1        4     2
30 NA        4     2
31 x2        4     2
32 NA        4     2
33 NA        4     2
34 x5        5     2
35 NA        5     2
36 NA        5     2

What I am trying to do is sort the var column by frequency for each itr, whilst preserving the the grouping seen in the nam column. So, for example, when itr = 1, and when we group var and nam, we can see that the structure x1, NA, NA appears twice, as does the structure x5, NA, NA. When itr = 2 the group x5, NA, NA appears twice, and all the rest of the groups appear only once.

My desired output would look something like the data frame below. Where, we can see that each var structure, has been sorted according to their frequency for each itr.

# A tibble: 36 × 3
   var     nam   itr
   <chr> <dbl> <dbl>
 1 x1        1     1
 2 NA        1     1
 3 NA        1     1
 4 x1        4     1
 5 NA        4     1
 6 NA        4     1
 7 x5        3     1
 8 NA        3     1
 9 NA        3     1
10 x5        5     1
11 NA        5     1
12 NA        5     1
13 x4        2     1
14 NA        2     1
15 x4        2     1
16 NA        2     1
17 NA        2     1
18 x5        2     2
19 NA        2     2
20 NA        2     2
21 x5        5     2
22 NA        5     2
23 NA        5     2
24 x4        1     2
25 NA        1     2
26 x4        1     2
27 NA        1     2
28 NA        1     2
29 x2        3     2
30 NA        3     2
31 NA        3     2
32 x1        4     2
33 NA        4     2
34 x2        4     2
35 NA        4     2
36 NA        4     2

I hope that makes sense.

Any suggestions as to how I could achieve this type of sorting whilst preserving the structure of each var by nam?

I was advised by another experienced user to rollback my edits from my previous question and ask a new separate question. — Electrino, Jan 13 '22 at 19:44
Hmm, I see that now. I guess I will leave it up to you. Another way around the "no one looks at old questions" problem is to start a [bounty](https://stackoverflow.com/help/privileges/set-bounties). That should easily get you an answer. — Ian Campbell, Jan 13 '22 at 19:45
I'm not sure if this helps, but note that `var` can be grouped according to `cumsum(!is.na(var))`. — Mikael Jagan, Jan 13 '22 at 20:26
@MikaelJagan that's not quite right, the `var` groups are already well marked by the `nam` column, and some of them do have multiple non-NA values, e.g., rows 4-8 are one "structure", and have values `x4, NA, x4, NA, NA` — Gregor Thomas, Jan 13 '22 at 20:42
@GregorThomas I see - my mistake, then. My understanding was that rows 4-8 contained 2 segments, not 1. — Mikael Jagan, Jan 13 '22 at 20:46

Gregor Thomas · Accepted Answer · 2022-01-13T20:54:14.210

Okay, I think this works. I've left in my helper columns for transparency, but you can remove them from the final select.

Basically, we collapse the data to one row per itr:nam group, keeping var around both as a concatenated string that we can group on and as a list column we can unnest to restore the original structures. We then count the frequency for each var_string, and sort the rows by this frequency, using the minimum nam value within the var_string group as a tiebreaker. This is the bulk of the work. We then unnest the data to get back to the original structure in the now-sorted order.

dfTest %>% 
  group_by(itr, nam) %>%
  summarize(var_string = paste(var, collapse = ";"), var= list(var))  %>%
  group_by(itr, var_string) %>%
  add_tally %>%
  mutate(min_nam = min(nam)) %>%
  arrange(itr, desc(n), min_nam, nam) %>%
  ungroup() %>%
  unnest(var) %>%
  select(var, nam, itr, everything()) %>%
  print.data.frame
#     var nam itr     var_string n min_nam
# 1    x1   1   1       x1;NA;NA 2       1
# 2  <NA>   1   1       x1;NA;NA 2       1
# 3  <NA>   1   1       x1;NA;NA 2       1
# 4    x1   4   1       x1;NA;NA 2       1
# 5  <NA>   4   1       x1;NA;NA 2       1
# 6  <NA>   4   1       x1;NA;NA 2       1
# 7    x5   3   1       x5;NA;NA 2       3
# 8  <NA>   3   1       x5;NA;NA 2       3
# 9  <NA>   3   1       x5;NA;NA 2       3
# 10   x5   5   1       x5;NA;NA 2       3
# 11 <NA>   5   1       x5;NA;NA 2       3
# 12 <NA>   5   1       x5;NA;NA 2       3
# 13   x4   2   1 x4;NA;x4;NA;NA 1       2
# 14 <NA>   2   1 x4;NA;x4;NA;NA 1       2
# 15   x4   2   1 x4;NA;x4;NA;NA 1       2
# 16 <NA>   2   1 x4;NA;x4;NA;NA 1       2
# 17 <NA>   2   1 x4;NA;x4;NA;NA 1       2
# 18   x5   2   2       x5;NA;NA 2       2
# 19 <NA>   2   2       x5;NA;NA 2       2
# 20 <NA>   2   2       x5;NA;NA 2       2
# 21   x5   5   2       x5;NA;NA 2       2
# 22 <NA>   5   2       x5;NA;NA 2       2
# 23 <NA>   5   2       x5;NA;NA 2       2
# 24   x4   1   2 x4;NA;x4;NA;NA 1       1
# 25 <NA>   1   2 x4;NA;x4;NA;NA 1       1
# 26   x4   1   2 x4;NA;x4;NA;NA 1       1
# 27 <NA>   1   2 x4;NA;x4;NA;NA 1       1
# 28 <NA>   1   2 x4;NA;x4;NA;NA 1       1
# 29   x2   3   2       x2;NA;NA 1       3
# 30 <NA>   3   2       x2;NA;NA 1       3
# 31 <NA>   3   2       x2;NA;NA 1       3
# 32   x1   4   2 x1;NA;x2;NA;NA 1       4
# 33 <NA>   4   2 x1;NA;x2;NA;NA 1       4
# 34   x2   4   2 x1;NA;x2;NA;NA 1       4
# 35 <NA>   4   2 x1;NA;x2;NA;NA 1       4
# 36 <NA>   4   2 x1;NA;x2;NA;NA 1       4

That's what you said last time ;) ... but this time I think you're right. — Gregor Thomas, Jan 13 '22 at 20:55

How to sort a character column by frequency of groups in R?

1 Answers1