split-apply-combine: sorting and grouping multiple variables in R

Question

I'm attempting to sort a dataframe based on continuous variables between two categories of data. Specifically, I want to sort (descending) on the continuous variable, but preserving variables of a similar type together. Here's an example:

    pets <- data.frame(animal = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish"),
       breed = c("retriever", "husky", "husky", "grey", "white", "guppy", "betta", "betta"),
       count = c(4, 3, 7, 8, 9, 2, 12, 1))

Right now, the data frame is unsorted. I want to sort it so that the breeds with the highest mean count appear first, but all breeds (and animals) stay grouped together. If I order the frame based on breed, I lose the correct order of the count, and vice versa. Even when I try both like so:

    pets[with(pets, order(breed, -count)), ]

the output is not sorted properly. I went through some split-apply-combine tutorials, but I could only find ones that attempted to keep one group of data together, not two as in my data.

Right now, this is the best I've got:

    split_pets <- split(pets, pets$animal)
    unlist(lapply(split_pets, function(x) sort(with(x, tapply(count, breed, mean)), decreasing = TRUE)))

Which returns

cat.white      cat.grey     dog.husky dog.retriever    fish.betta    fish.guppy 
      9.0           8.0           5.0           4.0           6.5           2.0

Sure, I've got the correct order they go in. But I don't actually even care about the means, I just need to sort the original data frame based on this. Next up is going down a rabbit hole of splitting this again based on breed, but then I'd be sorting based on a column of a data frame of a list of lists. That sounds way too complicated. I've also tried ordering on count and then piping that to group_by() from dplyr, but that didn't get me any farther than I'm at now.

Thanks for any help!

Kevin Arseneau · Accepted Answer · 2017-12-20T00:18:07.593

1

You can sort the groups first and then join it back to the original set in your intended order.

pets <- data.frame(
  animal = c("dog", "dog", "dog", "cat", "cat", "fish", "fish", "fish"),
  breed = c("retriever", "husky", "husky", "grey", "white", "guppy", "betta", "betta"),
  count = c(4, 3, 7, 8, 9, 2, 12, 1),
  stringsAsFactors = FALSE
)

library(dplyr)

pets %>%
  group_by(animal, breed) %>%
  summarise(avg = mean(count)) %>%
  right_join(pets, by = c("animal", "breed")) %>%
  arrange(animal, desc(avg), desc(count)) %>%
  select(-avg) %>%
  ungroup

# # A tibble: 8 x 3
#   animal     breed count
#    <chr>     <chr> <dbl>
# 1    cat     white     9
# 2    cat      grey     8
# 3    dog     husky     7
# 4    dog     husky     3
# 5    dog retriever     4
# 6   fish     betta    12
# 7   fish     betta     1
# 8   fish     guppy     2

edited Dec 20 '17 at 00:18

answered Dec 19 '17 at 23:58

Kevin Arseneau

6,186
1
21
40

This solution is much more elegant than the solution path I was on. However, running this exact code I get the error: `Error: 'by' can't contain join column 'animal', 'breed' which is missing from LHS'` Running line-by-line, this appears to come up at right_join() – AcademicDialysis Dec 20 '17 at 17:59
Solved the bug raised in my last comment. The problem was that `plyr` was loaded before `dplyr` in my namespace, resulting in a `summarise` result with only one mean instead of one for each group that came out of `group_by`. Stumbled upon that here: https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with – AcademicDialysis Dec 20 '17 at 18:14

split-apply-combine: sorting and grouping multiple variables in R

1 Answers1