2

TLDR: How do I sort objects without having to introduce a new S3 class globally?

In R we need to introduce S3 classes to sort custom objects (see, i.e., this answer). Here is an example where I sort a list of strings based on their length.

`[.customSort` <- function(x, i, ...) structure(unclass(x)[i], class = "customSort")
`==.customSort` <- function(a, b) nchar(a[[1]]) == nchar(b[[1]])
`>.customSort` <- function(a, b) nchar(a[[1]]) > nchar(b[[1]])
customObject <- structure(list('abc', 'de', 'fghi'), class = 'customSort')
unlist(sort(customObject))
# [1] "de"   "abc"  "fghi"

In my R package, I want to offer a sort function mySort(..., compare). But, instead of going through the create-an-S3-class ordeal, the user should be able to supply a compare function (this is similar to implementations in Python, cpp, Java, Go, etc.)

# Try 1
mySort <- function(someList, compare) {
  `[.tmpclass` <- function(x, i, ...) structure(unclass(x)[i], class = 'tmpclass')
  `==.tmpclass` <- function(a, b) compare(a[[1]],b[[1]]) == 0
  `>.tmpclass` <- function(a, b) compare(a[[1]],b[[1]]) > 0
  
  class(someList) <- 'tmpclass'
  sort(someList)
}

# Try 2
mySort <- function(someList, compare) {
  local({
    class(someList) <- 'tmpclass'
    sort(someList)

  }, envir = list(
    `[.tmpclass` = function(x, i, ...) structure(unclass(x)[i], class = 'tmpclass'),
    `==.tmpclass` = function(a, b) compare(a[[1]],b[[1]]) == 0,
    `>.tmpclass` = function(a, b) compare(a[[1]],b[[1]]) > 0
  ))
}

l <- list('hello', 'world', 'how', 'is', 'everything')

# sort by char length
mySort(l, compare = function(a,b) nchar(a) - nchar(b))

While on the top level comparisons work as expected, all "memory" of that temporary S3 class is lost once sort is called. So, things like someList[1] > someList[2] produce the expected result when debugging right before the sort call, but once I step into the sort call, all that information is lost.

Curiously enough, I do get one step further if I explicitly set the environment of the sort function.

environment(sort) <- environment()
sort(someList)

Through this, if I debug and step into sort, I am still able to make comparisons. Once sort calls more underlying methods however, this information is lost again.

The same goes if I try to call order (which is also called by sort at some point). If I set the environment for order before calling it, comparisons work fine when debugging and stepping into that function. But, once order calls xtfrm(x), this information is seemingly lost again.

mySort <- function(someList, compare) {
  `[.tmpclass` <- function(x, i, ...) structure(unclass(x)[i], class = 'tmpclass')
  `==.tmpclass` <- function(a, b) compare(a[[1]],b[[1]]) == 0
  `>.tmpclass` <- function(a, b) compare(a[[1]],b[[1]]) > 0
  
  class(someList) <- 'tmpclass'
  environment(order) <- environment()
  order(someList)
}

l <- list('hello', 'world', 'how', 'is', 'everything')
mySort(l, compare = function(a,b) nchar(a) - nchar(b))

Since xtfrm is a primitive function that I can't seem to debug, I have a hunch that that may be actually causing problems. But I'm not sure.

Finally, it does actually work if I use some tacky global-environment version.

mySort <- function(someList, compare) {
  # initialize globally
  `[.tmpclass` <<- function(x, i, ...) structure(unclass(x)[i], class = 'tmpclass')
  `==.tmpclass` <<- function(a, b) compare(a[[1]],b[[1]]) == 0
  `>.tmpclass` <<- function(a, b) compare(a[[1]],b[[1]]) > 0
  
  oldClass <- class(someList)
  class(someList) <- 'tmpclass'
  result <- sort(someList)
  
  # make sure not to leave garbage behind
  remove('[.tmpclass', '==.tmpclass', '>.tmpclass', envir = .GlobalEnv)
  structure(result, class = oldClass)
}

l <- list('hello', 'world', 'how', 'is', 'everything')
unlist(mySort(l, compare = function(a,b) nchar(a) - nchar(b)))
# [1] "is" "how" "hello" "world" "everything"

However, this does not feel like a solid answer, let alone something easily accepted by CRAN (unless there is some way to create unique names that don't accidentally overwrite global variables?)

Is there a way to sort objects using a simple comparison function without introducing an S3 class globally? Or should I write my own sort algorithm now?

Felix Jassler
  • 1,029
  • 11
  • 22

1 Answers1

3

You don't need to use S3 classes to create a custom sorting function in R. You can instead apply the comparator function to each pair of entries, then sort the tabulated comparisons.

As in the other languages you mention, comparator functions are normally binary logical functions which return TRUE if a is "greater" than b, and FALSE otherwise, so I will stick to that convention here.

We simply tabulate the number of times the comparison is "greater" for each element, and sort the resultant table. The names of the sorted table give us the indices of the original list, but sorted by our comparison function.

The following function handles all of this:

mySort <- function(someList, compare) {
  indices <- seq_along(someList)
  comps <- expand.grid(x = indices, y = indices)
  comps$diff <- apply(comps, 1, function(x)
    !compare(someList[[x[1]]], someList[[x[2]]])
  )
  answer <- table(comps$x, comps$diff)[,1] |> sort() |> names() |> as.numeric()
  result <- someList[answer]
  attributes(result) <- attributes(someList)
  names(result) <- names(someList)[answer]
  return(result)
}

This can handle arbitrary lists and vectors, returning the data in its original format. Testing on a list of strings, we have:

l <- list('hello', 'world', 'how', 'is', 'everything')

mySort(l, compare = function(a, b) nchar(a) > nchar(b))
#> [[1]]
#> [1] "is"
#> 
#> [[2]]
#> [1] "how"
#> 
#> [[3]]
#> [1] "hello"
#> 
#> [[4]]
#> [1] "world"
#> 
#> [[5]]
#> [1] "everything"

Or, testing on a vector of integers, we can sort them according to their value modulo 3:

l2 <- c(1, 2, 3, 4, 5, 6)

mySort(l2, compare = function(a, b) a %% 3 > b %% 3)
#> [1] 3 6 1 4 2 5

We can even sort lists of vectors (like data frames) by specified criteria. Here, for example, we sort the columns of mtcars according to the first value in each column:

head(mySort(mtcars, compare = function(a, b) a[1] > b[1]))
#>                   vs am    wt drat gear carb cyl  qsec  mpg  hp disp
#> Mazda RX4          0  1 2.620 3.90    4    4   6 16.46 21.0 110  160
#> Mazda RX4 Wag      0  1 2.875 3.90    4    4   6 17.02 21.0 110  160
#> Datsun 710         1  1 2.320 3.85    4    1   4 18.61 22.8  93  108
#> Hornet 4 Drive     1  0 3.215 3.08    3    1   6 19.44 21.4 110  258
#> Hornet Sportabout  0  0 3.440 3.15    3    2   8 17.02 18.7 175  360
#> Valiant            1  0 3.460 2.76    3    1   6 20.22 18.1 105  225

And to demonstrate it working on an arbitrary list of objects, let's sort a list of lm objects according to their memory size:

models <- list(iris = lm(Sepal.Length ~ Petal.Width + Species, data = iris),
               mtcars = lm(mpg ~ wt, data = mtcars))

mySort(models, function(a, b) object.size(a) > object.size(b))
#> $mtcars
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      37.285       -5.344  
#> 
#> 
#> $iris
#> 
#> Call:
#> lm(formula = Sepal.Length ~ Petal.Width + Species, data = iris)
#> 
#> Coefficients:
#>       (Intercept)        Petal.Width  Speciesversicolor   Speciesvirginica  
#>           4.78044            0.91690           -0.06025           -0.05009

Created on 2023-01-08 with reprex v2.0.2

Felix Jassler
  • 1,029
  • 11
  • 22
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Is it possible to avoid `unlist`? The user should be able to supply any data, not just a list of strings or single numbers (see i.e. [this question](https://stackoverflow.com/q/70137844/7669319)) – Felix Jassler Jan 08 '23 at 18:33
  • @FelixJassler sorry, that wasn't clear from the question. Your example showed an input that was a list of strings but returned a _vector_ of strings, so I assumed that was what you wanted. Can you define a bit better what the possible input and output should be? Should the output simply be the same format as the input, unlike the example in your question? – Allan Cameron Jan 08 '23 at 19:33
  • My bad, I thought I specified somewhere that the user should be able to supply a list with items of any type. Here I used simple strings, in my first comment I [linked to a case](https://stackoverflow.com/q/70137844/7669319) where I want to sort vectors (which fails with this approach). I am really not looking for solutions to these specific cases, but a general way to just sort lists of any type, given that the user specifies the relation between items. – Felix Jassler Jan 08 '23 at 20:03
  • @FelixJassler see my update, which uses indices and copies attributes. It should work on lists of arbitrary objects and even on vectors – Allan Cameron Jan 08 '23 at 20:08
  • I've spotted an edge case where every comparison is `FALSE` (i.e., every item is considered equal). I found a quick fix by simply inverting the compare-evaluation and reading the first column instead of the second, but I can't edit your comment because the "edit queue is full" – Felix Jassler Jan 10 '23 at 13:30
  • 1
    I've done the inversions now @FelixJassler – Allan Cameron Jan 10 '23 at 13:33