A sensible function to sort dataframes

Question

I am creating a function to sort data.frames (Why? Because of reasons). Some of the criteria:

Works on data.frames
Aimed at non-interactive use
Uses base R only
No dependency on any non-base package

The function looks like this now:

#' @title Sort a data.frame
#' @description Sort a data.frame based on one or more columns
#' @param x A data.frame class object
#' @param by A column in the data.frame. Defaults to NULL, which sorts by all columns.
#' @param decreasing A logical indicating the direction of sorting.
#' @return A data.frame.
#' 
sortdf <- function(x,by=NULL,decreasing=FALSE) {
  if(!is.data.frame(x)) stop("Input is not a data.frame.")

  if(is.null(by)) {
    ord <- do.call(order,x)
  } else {
    if(any(!by %in% colnames(x))) stop("One or more items in 'by' was not found.")
    if(length(by) == 1) ord <- order(x[ , by])
    if(length(by) > 1) ord <- do.call(order, x[ , by])
  }
  
  if(decreasing) ord <- rev(ord)
  return(x[ord, , drop=FALSE])
}

Examples

sortdf(iris)
sortdf(iris,"Petal.Length")
sortdf(iris,"Petal.Length",decreasing=TRUE)
sortdf(iris,c("Petal.Length","Sepal.Length"))
sortdf(iris,"Petal.Length",decreasing=TRUE)

What works so far

Sort data.frame by one or more columns
Adjust overall direction of sort

But, I need one more feature: The ability to set sorting direction for each column separately by passing a vector of directions for each column specified in by. For example;

sortdf(iris,by=c("Sepal.Width","Petal.Width"),dir=c("up","down"))

Any ideas/suggestions on how to implement this?

Update

Benchmark of answers below:

library(microbenchmark)
library(ggplot2)

m <- microbenchmark::microbenchmark(
  "base 1u"=iris[order(iris$Petal.Length),],
  "Maël 1u"=sortdf(iris,"Petal.Length"),
  "Mikko 1u"=sortdf1(iris,"Petal.Length"),
  "arrange 1u"=dplyr::arrange(iris,Petal.Length),
  "base 1d"=iris[order(iris$Petal.Length,decreasing=TRUE),],
  "Maël 1d"=sortdf(iris,"Petal.Length",dir="down"),
  "Mikko 1d"=sortdf1(iris,"Petal.Length",decreasing=T),
  "arrange 1d"=dplyr::arrange(iris,-Petal.Length),
  "base 2d"=iris[order(iris$Petal.Length,iris$Sepal.Length,decreasing=TRUE),],
  "Maël 2d"=sortdf(iris,c("Petal.Length","Sepal.Length"),dir=c("down","down")),
  "Mikko 2d"=sortdf1(iris,c("Petal.Length","Sepal.Length"),decreasing=T),
  "arrange 2d"=dplyr::arrange(iris,-Petal.Length,-Sepal.Length),
  "base 1u1d"=iris[order(iris$Petal.Length,rev(iris$Sepal.Length)),],
  "Maël 1u1d"=sortdf(iris,c("Petal.Length","Sepal.Length"),dir=c("up","down")),
  "Mikko 1u1d"=sortdf1(iris,c("Petal.Length","Sepal.Length"),decreasing=c(T,F)),
  "arrange 1u1d"=dplyr::arrange(iris,Petal.Length,-Sepal.Length),
  times=1000
)
autoplot(m)+theme_bw()

R 4.1.0
dplyr 1.0.7

Does `dir` differ from `decreasing` only from the fact that it can be specified differently from different columns? — Maël, Jan 19 '22 at 12:55
@Maël I guess we can get rid of `decreasing` once `dir` works. If no columns are specified, then `dir` can default to `rep("up",ncol(x))`. Maybe `dir` is not be best name, it sounds a bit like directory. — mindlessgreen, Jan 19 '22 at 13:02
You could take a look at the `gx.sort.df` function of the [rgr package](https://cran.r-project.org/web/packages/rgr/index.html). I like its syntax, with formulas: `gx.sort.df(dat, ~ colA + colB)`. And with a `-` instead of a `+` it sorts in descending order — Stéphane Laurent, Jan 19 '22 at 14:13
@StéphaneLaurent Interesting! Didn't know about this, but it uses formulas. One could parse it and do stuff to it, but I am not convinced that formulas work well for non-interactive use. — mindlessgreen, Jan 19 '22 at 14:23
@rmf Are you "locked in" to the whole `function(x, by = character(), dir = character())` format? It is possible to mimic [`dplyr::arrange()`](https://dplyr.tidyverse.org/reference/arrange.html) in **`base`** R. So `sortdf(iris, c("Sepal.Width", "Petal.Width"), c("up", "down"))` would become `sortdf(iris, Sepal.Width, desc(Petal.Width))`, and so forth. — Greg, Jan 19 '22 at 17:35
@Greg As I mentioned, this particular case is for non-interactive use (ie; use it inside other functions), so `desc()` is likely not going to work so well. But, for interactive use, `dplyr::arrange()` is probably the best option and I completely agree with you that it would be nice to have a base R equivalent that works similarly. — mindlessgreen, Jan 20 '22 at 11:13

score 2 · Answer 1 · edited Jan 20 '22 at 12:21

Here's my attempt, using a function taken from this answer, and assuming up is ascending, and down is descending. dir is set to "up" by default.

sortdf <- function(x, by=NULL, dir=NULL) {
  if(!is.data.frame(x)) stop("Input is not a data.frame.")
  
  if(is.null(by) & is.null(dir)) {
    dir <- rep("up", ncol(x))
  } else if (is.null(dir)) {
    dir <- rep("up", length(by))
  }
  
  sort_asc = by[which(dir == "up")]
  sort_desc = by[which(dir == "down")]

  if(is.null(by)) {
    ord <- do.call(order,x)
  } else {
    if(any(!by %in% colnames(x))) stop("One or more items in 'by' was not found.")
    if(length(by) == 1) ord <- order(x[ , by])
    if(length(by) > 1) ord <- do.call(order, c(as.list(iris[sort_asc]), lapply(iris[sort_desc], 
                                                                               function(x) -xtfrm(x))))
  }
  
   if(length(dir) == 1 & all(dir == "down")) ord <- rev(ord)

  x[ord, , drop=FALSE]
}

You can then have multiple different directions to sort:

sortdf(iris, by=c("Sepal.Width","Petal.Width"), dir=c("up","down")) |>
  head()

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
61           5.0         2.0          3.5         1.0 versicolor
69           6.2         2.2          4.5         1.5 versicolor
120          6.0         2.2          5.0         1.5  virginica
63           6.0         2.2          4.0         1.0 versicolor
54           5.5         2.3          4.0         1.3 versicolor
88           6.3         2.3          4.4         1.3 versicolor

And other examples work as intended as well:

sortdf(iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
14          4.3         3.0          1.1         0.1  setosa
9           4.4         2.9          1.4         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa
42          4.5         2.3          1.3         0.3  setosa
4           4.6         3.1          1.5         0.2  setosa

sortdf(iris, c("Petal.Length","Sepal.Length"))
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
23          4.6         3.6          1.0         0.2  setosa
14          4.3         3.0          1.1         0.1  setosa
36          5.0         3.2          1.2         0.2  setosa
15          5.8         4.0          1.2         0.2  setosa
39          4.4         3.0          1.3         0.2  setosa
43          4.4         3.2          1.3         0.2  setosa

sortdf(iris, "Petal.Length", "down")
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
119          7.7         2.6          6.9         2.3 virginica
123          7.7         2.8          6.7         2.0 virginica
118          7.7         3.8          6.7         2.2 virginica
106          7.6         3.0          6.6         2.1 virginica
132          7.9         3.8          6.4         2.0 virginica
108          7.3         2.9          6.3         1.8 virginica

Thanks for this! Interesting that it doesn't preserve order of row numbers. `all.equal(rownames(sortdf(iris,"Petal.Length")),rownames(dplyr::arrange(iris,Petal.Length)))` — mindlessgreen, Jan 20 '22 at 11:26
By "it", I meant `dplyr::arrange()` and I never even noticed it before. — mindlessgreen, Jan 20 '22 at 11:42

Mikko Marttila · Answer 2 · 2022-01-20T14:46:27.747

Here’s another alternative that gets rid of all the branching logic by ensuring you always find a proxy to sort by for each by column with xtfrm(). For consistency with base, instead of using a “new” dir argument, it might also be preferable to keep the decreasing argument, but just allow it to be a vector that’s recycled to match the by length.

sortdf <- function(x, by = colnames(x), decreasing = FALSE, ...) {
  if (!is.data.frame(x)) {
    stop("Input is not a data.frame.")
  }

  if (!all(by %in% colnames(x))) {
    stop("One or more items in 'by' was not found.")
  }
  
  # Recycle `decreasing` to ensure it matches `by`
  decreasing <- rep_len(as.logical(decreasing), length(by))
  
  # Find a sorting proxy for each `by` column, according to `decreasing`
  pxy <- Map(function(x, decr) (-1)^decr * xtfrm(x), x[by], decreasing)
  ord <- do.call(order, c(pxy, list(...)))
  
  x[ord, , drop = FALSE]
}

Thinking about this a bit more, I might even simplify this further and:

Let Map() handle the recycling for by and decreasing.
Let [ handle throwing errors for incorrect indexing (and consequently also accept numeric indices for columns rather than just names).
Not pass ... (following the YAGNI principle).

This could then come down to two one-liner functions:

sortdf <- function(x, by = colnames(x), decreasing = FALSE) {
  x[do.call(order, Map(sortproxy, x[by], decreasing)), , drop = FALSE]
}

sortproxy <- function(x, decreasing = FALSE) {
  as.integer((-1)^as.logical(decreasing)) * xtfrm(x)
}

Examples:

sortdf(iris) |> head()
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14          4.3         3.0          1.1         0.1  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 39          4.4         3.0          1.3         0.2  setosa
#> 43          4.4         3.2          1.3         0.2  setosa
#> 42          4.5         2.3          1.3         0.3  setosa
#> 4           4.6         3.1          1.5         0.2  setosa

sortdf(iris, by = c("Sepal.Length", "Sepal.Width")) |> head()
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 14          4.3         3.0          1.1         0.1  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 39          4.4         3.0          1.3         0.2  setosa
#> 43          4.4         3.2          1.3         0.2  setosa
#> 42          4.5         2.3          1.3         0.3  setosa
#> 4           4.6         3.1          1.5         0.2  setosa

sortdf(iris, by = c("Sepal.Length", "Sepal.Width"), decreasing = TRUE) |> head()
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 132          7.9         3.8          6.4         2.0 virginica
#> 118          7.7         3.8          6.7         2.2 virginica
#> 136          7.7         3.0          6.1         2.3 virginica
#> 123          7.7         2.8          6.7         2.0 virginica
#> 119          7.7         2.6          6.9         2.3 virginica
#> 106          7.6         3.0          6.6         2.1 virginica

sortdf(iris, by = c("Sepal.Length", "Sepal.Width"), decreasing = c(TRUE, FALSE)) |> head()
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 132          7.9         3.8          6.4         2.0 virginica
#> 119          7.7         2.6          6.9         2.3 virginica
#> 123          7.7         2.8          6.7         2.0 virginica
#> 136          7.7         3.0          6.1         2.3 virginica
#> 118          7.7         3.8          6.7         2.2 virginica
#> 106          7.6         3.0          6.6         2.1 virginica

Nice! I like the simplicity and the use of decreasing. I am surprised that this doesn't translate to faster execution time. — mindlessgreen, Jan 20 '22 at 12:26
Yeah, to be honest I'm surprised to see there are any differences in performance at all. My first thought was that it might be because of the `xtfrm()` calls even on the non-decreasing cols, but now I'm not sure anymore. Regardless, I suspect the differences will be smoothed out quickly with larger data. — Mikko Marttila, Jan 20 '22 at 12:44
@rmf Oh I realized that my method is specifically slower for the case of sorting with only 1 column. In @Maël's answer that's special cased to avoid the `do.call()`, which is where the difference comes from. — Mikko Marttila, Jan 21 '22 at 11:30
Would you mind elaborating what exactly is `* xtfrm(x)` supposed to do in `(-1)^decr * xtfrm(x)`? — tmfmnk, Jan 21 '22 at 13:47
@tmfmnk Sure: `xtfrm(x)` gives a numeric vector that will sort in the same way as the original `x` would have. That let's you do a decreasing sort by taking a negative value. Of course for numeric inputs it doesn't matter, but it's required for factor/character inputs and let's you treat all inputs uniformly. — Mikko Marttila, Jan 21 '22 at 14:10
And the `(-1)^decr` is basically just a way to write `if (decr) -1 else 1` algebraically. — Mikko Marttila, Jan 21 '22 at 14:11
Perhaps a better explanation would be that we know how to sort numeric value in decreasing order: use the order you would get from sorting the negated values in ascending order. `xtfrm()` allows us to generalize to any input, including factor/character, by giving a way to turn them into numeric values for sorting purposes. `rank()` would do a similar thing, as is mentioned in `?xtfrm` documentation -- but `xtfrm()` is generic so allows custom methods for custom classes. — Mikko Marttila, Jan 21 '22 at 14:32

A sensible function to sort dataframes

2 Answers2

Linked