Reordering a factor based on a summary statistic of a subset of the data

Question

I'm trying to reorder a factor from a subset of my data frame, defined by another factor using forcats::fct_reorder().

Consider the following data frame df:

set.seed(12)
df <- data.frame(fct1 = as.factor(rep(c("A", "B", 'C'), each = 200)),
             fct2 = as.factor(rep(c("j", "k"), each = 100)), 
             val = c(rnorm(100, 2), # A - j
                     rnorm(100, 1), # A - k
                     rnorm(100, 1), # B - j
                     rnorm(100, 6), # B - k
                     rnorm(100, 8), # C - j
                     rnorm(100, 4)))# C - k

I want to plot facetted group densities using the ggridges package. For example:

ggplot(data = df, aes(y = fct2, x = val)) +
    stat_density_ridges(geom = "density_ridges_gradient",
                        calc_ecdf = T, 
                        quantile_fun = median, 
                        quantile_lines = T) +
    facet_wrap(~fct1, ncol = 1)

I would now like to order fct1 by the median (default in fct_reorder()) of the values of the upper density in each facet, i.e. where fct2 == "k". The goal in this example would therefore be that the facets appear in the order B - C - A. This seems very similar to this question here, with the difference that I do not want to summarize the data first because I need the raw data to plot the densities.

I've tried to adapt the code in the answer of the linked question:

df <- df %>% mutate(fct1 = forcats::fct_reorder(fct1, filter(., fct2 == 'k') %>% pull(val)))

But it returns the following error:

Error in forcats::fct_reorder(fct1, filter(., fct2 == "k") %>% pull(val)) :

length(f) == length(.x) is not TRUE

It's obvious that they are not the same length, but I don't quite get why this error is necessary. My guess is that it's generally not guaranteed that all levels of fct1 are present in the subset, which would certainly be problematic. Yet, this isn't the case in my example. Is there a way to work around this error or am I doing something wrong more generally?

I'm aware that I can work around this with a couple of lines of extra code, e.g. create a helper variable of the subsetted data, reorder that and then take the level order to my factor in the original data set. I would still like a prettier solution, because I regularly face that very same task.

score 1 · Answer 1 · answered Jun 10 '20 at 12:29

1

You can do this with a little helper function:

f <- function(i) -median(df$val[df$fct2 == "k" & df$fct1 == df$fct1[i]])

Which allows you to reorder like this:

df$fct1 <- forcats::fct_reorder(df$fct1, sapply(seq(nrow(df)), f))

Which gives you this plot:

ggplot(data = df, aes(y = fct2, x = val)) +
    stat_density_ridges(geom = "density_ridges_gradient",
                        calc_ecdf = T, 
                        quantile_fun = median, 
                        quantile_lines = T) +
    facet_wrap(~fct1, ncol = 1)

answered Jun 10 '20 at 12:29

Allan Cameron

147,086
7
49
87

Thanks. Definitely works, but I would argue that the solution you provided is in the realm of things you come up with on the fly when you just want it to work. I feel like there should be a more elegant way of solving such a common problem. Maybe I'm wrong :). – PRZ Jun 10 '20 at 12:50
@PinotTiger maybe there is a better way, but I don't think it's all that common a problem. When you say it out loud - "I want a mechanism whereby I can reorder a variable's factor levels based on the median of a different variable, but only when a third variable has a particular factor level", it's easy to see why a package writer may have overlooked it. I don't think I've come across this specific requirement before, and I have done a _lot_ of data wrangling. If you can do the data manipulation in fewer characters than it takes to describe (as here), you're normally doing well. – Allan Cameron Jun 10 '20 at 13:03
Agreed, I get that this is already a pretty short solution to the problem. Yet, the reason why I created the question in the first place is because I wanted to move away from the longer solutions I've come up with in the past (see last paragraph of my question). I'd still argue that it's quite common when it comes to plotting data. `ggridges` is just one example, but it occurs all the time with dodged bar plots, box plots, etc. And off-topic side note: Isn't it the beauty of code that it generally requires fewer characters than explaining what you're doing? :) – PRZ Jun 10 '20 at 13:19

Reordering a factor based on a summary statistic of a subset of the data

1 Answers1