Sorting one variable in a data frame by id

Question

I have a data frame with lot of company information separated by an id variable. I want to sort one of the variables and repeat it for every id. Let's take this example,

df <- structure(list(id = c(110, 110, 110, 90, 90, 90, 90, 252, 252
), var1 = c(26, 21, 54, 10, 18, 9, 16, 54, 39), var2 = c(234, 
12, 43, 32, 21, 19, 16, 34, 44)), .Names = c("id", "var1", "var2"
), row.names = c(NA, -9L), class = "data.frame")

Which looks like this

df
   id var1 var2
1 110   26  234
2 110   21   12
3 110   54   43
4  90   10   32
5  90   18   21
6  90    9   19
7  90   16   16
8 252   54   34
9 252   39   44

Now, I want to sort the data frame according to var1 by the vector id. Easiest solution I can think of is using apply function like this,

> apply(df, 2, sort)
       id var1 var2
 [1,]  90    9   12
 [2,]  90   10   16
 [3,]  90   16   19
 [4,]  90   18   21
 [5,] 110   21   32
 [6,] 110   26   34
 [7,] 110   39   43
 [8,] 252   54   44
 [9,] 252   54  234

However, this is not the output I am seeking. The correct output should be,

   id var1 var2
1 110   21   12
2 110   26  234
3 110   54   43
4  90    9   19
5  90   10   32
6  90   16   16
7  90   18   21
8 252   39   44
9 252   54   34

Group by id and sort by var1 column and keep original id column order.

Any idea how to sort like this?

Possible duplicate of [How to sort a dataframe by multiple column(s)?](https://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-multiple-columns) — jay.sf, Mar 20 '19 at 09:00
I don't believe this is a dupe, at least not that one. The OP is asking for a way to sort by groups of `id`. — Rui Barradas, Mar 20 '19 at 09:02
Why is `id - 110` first? what is the logic behind the sorting? — Sotos, Mar 20 '19 at 09:04
@Sotos Column `id` is not to be sorted, only `var1` is, within each group of `id`. — Rui Barradas, Mar 20 '19 at 09:06

thothal · Accepted Answer · 2019-03-20T09:58:57.280

9

Note. As mentioned by Moody_Mudskipper, there is no need to use tidyverse and can also be done easily with base R:

df[order(ordered(df$id, unique(df$id)), df$var1), ]

A one-liner tidyverse solution w/o any temp vars:

library(tidyverse)
df %>% arrange(ordered(id, unique(id)), var1)
#    id var1 var2
# 1 110   26  234
# 2 110   21   12
# 3 110   54   43
# 4  90   10   32
# 5  90   18   21
# 6  90    9   19
# 7  90   16   16
# 8 252   54   34
# 9 252   39   44

Explanation of why apply(df, 2, sort) does not work

What you were trying to do is to sort each column independently. apply runs over the specified dimension (2 in this case which corresponds to columns) and applies the function (sort in this case).

apply tries to further simplify the results, in this case to a matrix. So you are getting back a matrix (not a data.frame) where each column is sorted independently. For example this row from the apply call:

# [1,]  90    9   12

does not even exist in the original data.frame.

edited Mar 20 '19 at 09:58

answered Mar 20 '19 at 09:30

thothal

16,690
3
36
71

2

In base `R` your solution would become `df[order(factor(df$id,unique(df$id)), df$var1),]` (I use `factor` instead of `ordered` as `ordered` is not necessary, longer to type and less known). – moodymudskipper Mar 20 '19 at 09:40
1

True, `ordered` is longer than `factor` but IMHO semantically clearer. A `factor` has only an implicit `order` while `ordered` has an explicit one. `R` is not coherent here I would say: `min(factor(1:3, 3:1))` does not work for obvious reasons, but `sort( factor(1:3, 3:1))` does, because the `factor` is silently treated as a numeric. – thothal Mar 20 '19 at 09:45
1

In my view `ordered` is something I would use for models, to aggregate values (`range`,`min`, `max`, `median`, `quantile`, ...), or to arrange ggplot categories, so I see its use here as extra complexity, but I see your point (relevant: https://stackoverflow.com/questions/23396591/factors-ordered-vs-levels/45196908#45196908 ). But I'm nitpicking (and upvoted from the start). – moodymudskipper Mar 20 '19 at 10:42
Good read +1. I think that `sort(factor(.))` works because of legacy reasons rather than intentional. I can hardly imagine how you would define a sort algorithm w/o comparison ops. At some point it was too late to fix that and now it is valid R code. My take is that I want to be explicit rather than implicit to make my code more understandable. That's why tidyverse` wins for me over `data.table`, unless performance is a deakbreaker. – thothal Mar 20 '19 at 12:26

score 6 · Answer 2 · answered Mar 20 '19 at 09:21

Another base R option using order and match

df[with(df, order(match(id, unique(id)), var1, var2)), ]
#   id var1 var2
#2 110   21   12
#1 110   26  234
#3 110   54   43
#6  90    9   19
#4  90   10   32
#7  90   16   16
#5  90   18   21
#9 252   39   44
#8 252   54   34

score 3 · Answer 3 · answered Mar 20 '19 at 09:20

We can convert the id to factor in order to split while preserving the original order. We can then loop over the list and order, and rbind again, i.e.

df$id <- factor(df$id, levels = unique(df$id))
do.call(rbind, lapply(split(df, df$id), function(i)i[order(i$var1),]))

#       id var1 var2
#110.2 110   21   12
#110.1 110   26  234
#110.3 110   54   43
#90.6   90    9   19
#90.4   90   10   32
#90.7   90   16   16
#90.5   90   18   21
#252.9 252   39   44
#252.8 252   54   34

NOTE: You can reset the rownames by rownames(new_df) <- NULL

moodymudskipper · Answer 4 · 2019-03-20T11:12:38.727

3

In base R we could use split<- :

split(df,df$id) <- lapply(split(df,df$id), function(x) x[order(x$var1),] )

or as @Markus suggests :

split(df, df$id) <- by(df, df$id, function(x) x[order(x$var1),])

output in either case :

df
#    id var1 var2
# 1 110   21   12
# 2 110   26  234
# 3 110   54   43
# 4  90    9   19
# 5  90   10   32
# 6  90   16   16
# 7  90   18   21
# 8 252   39   44
# 9 252   54   34

edited Mar 20 '19 at 11:12

answered Mar 20 '19 at 09:31

moodymudskipper

46,417
11
121
167

Rui Barradas · Answer 5 · 2019-03-20T09:16:41.053

With the following tidyverse pipe, the question's output is reproduced.

library(tidyverse)

df %>%
  mutate(tmp = cumsum(c(0, diff(id) != 0))) %>%
  group_by(id) %>%
  arrange(tmp, var1) %>%
  select(-tmp)
## A tibble: 9 x 3
## Groups:   id [3]
#     id  var1  var2
#  <dbl> <dbl> <dbl>
#1   110    21    12
#2   110    26   234
#3   110    54    43
#4    90     9    19
#5    90    10    32
#6    90    16    16
#7    90    18    21
#8   252    39    44
#9   252    54    34

Sorting one variable in a data frame by id

5 Answers5