3

I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.

I understand that is if I do the following, it should arrange the records in increasing order of Temp variable

airquality_max1 <- airquality %>% arrange(Temp)

I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command

airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)

However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.

I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?

More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.

Satya
  • 1,708
  • 1
  • 15
  • 39
  • 1
    Maybe you also need to add `Month` parameter in `arrange`. `airquality_max2 <- airquality %>% arrange(Month, Temp)` – Ronak Shah Sep 05 '17 at 02:08
  • sorting is not an aggregation, so there's no need to use `group_by`... – MichaelChirico Sep 05 '17 at 02:10
  • I was trying to use this as a pedantic example to show the application of `group_by` but was stumbled to find this behavior. – Satya Sep 05 '17 at 02:13
  • Possible duplicate of [How to sort a dataframe by column(s)?](https://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns) – Ronak Shah Sep 05 '17 at 02:15

1 Answers1

4

arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:

airquality %>% arrange(Month, Temp)

For grouped data frame, you can also .by_group variable to sort by the group variable first.

airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks for the quick answer. The link is helpful, this behavior is counter-intuitive but as long it doesn't change again, it's okay. – Satya Sep 05 '17 at 02:11
  • `airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)` gives me an error. `Error in arrange_impl(.data, dots) : incorrect size (1), expecting : 153` any idea why? – Ronak Shah Sep 05 '17 at 02:17
  • You may think you need to sort by a group variable, but usually you don't, as long as the sorting algorithm is stable (which I believe they are), you can either do `group_by %>% arrange` or `arrange %>% group_by`. `group_by` by itself will sort the data frame by the group variable. So effectively you are still sorting the data frame by both the group variable and the sorting variable even you didn't explicitly tell it to. – Psidom Sep 05 '17 at 02:18
  • @RonakShah I am not sure. It seems to run fine on my machine. – Psidom Sep 05 '17 at 02:21
  • @RonakShah - I got the error, the first time, ran it again and it worked fine. – Satya Sep 05 '17 at 02:23