10

I've finally decided to put the sort.data.frame method that's floating around the internet into an R package. It just gets requested too much to be left to an ad hoc method of distribution.

However, it's written with arguments that make it incompatible with the generic sort function:

sort(x,decreasing,...)
sort.data.frame(form,dat)

If I change sort.data.frame to take decreasing as an argument as in sort.data.frame(form,decreasing,dat) and discard decreasing, then it loses its simplicity because you'll always have to specify dat= and can't really use positional arguments. If I add it to the end as in sort.data.frame(form,dat,decreasing), then the order doesn't match with the generic function. If I hope that decreasing gets caught up in the dots `sort.data.frame(form,dat,...), then when using position-based matching I believe the generic function will assign the second position to decreasing and it will get discarded. What's the best way to harmonize these two functions?

The full function is:

# Sort a data frame
sort.data.frame <- function(form,dat){
# Author: Kevin Wright
# http://tolstoy.newcastle.edu.au/R/help/04/09/4300.html
# Some ideas from Andy Liaw
# http://tolstoy.newcastle.edu.au/R/help/04/07/1076.html
# Use + for ascending, - for decending.
# Sorting is left to right in the formula
# Useage is either of the following:
# sort.data.frame(~Block-Variety,Oats)
# sort.data.frame(Oats,~-Variety+Block)

# If dat is the formula, then switch form and dat
  if(inherits(dat,"formula")){
    f=dat
    dat=form
    form=f
  }
  if(form[[1]] != "~") {
    stop("Formula must be one-sided.")
  }
# Make the formula into character and remove spaces
  formc <- as.character(form[2])
  formc <- gsub(" ","",formc)
# If the first character is not + or -, add +
  if(!is.element(substring(formc,1,1),c("+","-"))) {
    formc <- paste("+",formc,sep="")
  }
# Extract the variables from the formula
  vars <- unlist(strsplit(formc, "[\\+\\-]"))
  vars <- vars[vars!=""] # Remove spurious "" terms
# Build a list of arguments to pass to "order" function
  calllist <- list()
  pos=1 # Position of + or -
  for(i in 1:length(vars)){
    varsign <- substring(formc,pos,pos)
    pos <- pos+1+nchar(vars[i])
    if(is.factor(dat[,vars[i]])){
      if(varsign=="-")
        calllist[[i]] <- -rank(dat[,vars[i]])
      else
        calllist[[i]] <- rank(dat[,vars[i]])
    }
    else {
      if(varsign=="-")
        calllist[[i]] <- -dat[,vars[i]]
      else
        calllist[[i]] <- dat[,vars[i]]
    }
  }
  dat[do.call("order",calllist),]
} 

Example:

library(datasets)
sort.data.frame(~len+dose,ToothGrowth)
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235

3 Answers3

6

Use the arrange function in plyr. It allows you to individually pick which variables should be in ascending and descending order:

arrange(ToothGrowth, len, dose)
arrange(ToothGrowth, desc(len), dose)
arrange(ToothGrowth, len, desc(dose))
arrange(ToothGrowth, desc(len), desc(dose))

It also has an elegant implementation:

arrange <- function (df, ...) {
  ord <- eval(substitute(order(...)), df, parent.frame())
  unrowname(df[ord, ])
}

And desc is just an ordinary function:

desc <- function (x) -xtfrm(x)

Reading the help for xtfrm is highly recommended if you're writing this sort of function.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • 2
    Thanks. This seems poised to become my replacement. But I'm still curious how one would go about making a generic and its methods consistent, since it comes up fairly often for me. Also, syntactically, a sort() method would seem to keep things consistent with other data types. But that's some pretty code :-) – Ari B. Friedman Jul 27 '11 at 07:05
  • 1
    `?arrange` indicates that: "# NOTE: plyr functions do NOT preserve row.names". This makes this excellent function suboptimal if one wants to keep `row.names`. Why not add a `keep.row.names=FALSE` option? – landroni Mar 10 '14 at 16:29
  • @landroni because I don't think that they're a good idea - it's better to add them as an explicit variable. – hadley Mar 11 '14 at 03:08
  • 1
    I see. But still, this is standard functionality associated with `data.frame`, at least as far as most users are concerned, and it would be useful to give those users the choice. – landroni Mar 11 '14 at 10:21
5

There are a few problems there. sort.data.frame needs to have the same arguments as the generic, so at a minimum it needs to be

sort.data.frame(x, decreasing = FALSE, ...) {
....
}

To have dispatch work, the first argument needs to be the object dispatched on. So I would start with:

sort.data.frame(x, decreasing = FALSE, formula = ~ ., ...) {
....
}

where x is your dat, formula is your form, and we provide a default for formula to include everything. (I haven't studied your code in detail to see exactly what form represents.)

Of course, you don't need to specify decreasing in the call, so:

sort(ToothGrowth, formula = ~ len + dose)

would be how to call the function using the above specifications.

Otherwise, if you don't want sort.data.frame to be an S3 generic, call it something else and then you are free to have whatever arguments you want.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • With partial matching, it isn't so bad to write `sort(ToothGrowth, f = ~ len + dose)` so that's why I did and kept the S3ness of it. Thanks for the suggestion. – Ari B. Friedman Jul 26 '11 at 22:53
  • 1
    Shouldn't we define a `sort.data.frame.formula` that would take a formula as first argument, and if it fails the formula test in `Use.method` would then dispatch to sort.data.frame that takes a first data argument? (Same as the situation with `aggregate.*`) – IRTFM Jan 12 '13 at 21:30
  • @DWin You mean `sort.formula`, yes? – Gavin Simpson Jan 12 '13 at 22:21
  • I was thinking I wanted it to drop back to a `sort.data.frame.default` method or `sort.dataframe` that would accept a first argument as a dataframe. – IRTFM Jan 12 '13 at 22:24
0

I agree with @Gavin that x must come first. I'd put the decreasing parameter after the formula though - since it probably isn't used that much, and hardly ever as a positional argument.

The formula argument would be used much more and therefore should be the second argument. I also strongly agree with @Gavin that it should be called formula, and not form.

sort.data.frame(x, formula = ~ ., decreasing = FALSE, ...) {
  ...
}

You might want to extend the decreasing argument to allow a logical vector where each TRUE/FALSE value corresponds to one column in the formula:

d <- data.frame(A=1:10, B=10:1)
sort(d, ~ A+B, decreasing=c(A=TRUE, B=FALSE)) # sort by decreasing A, increasing B
Tommy
  • 39,997
  • 12
  • 90
  • 85
  • 1
    I'd *like* the formula argument to be second, but I'm not sure I can have it that way and still have it be an S3 class. I'd like to not have a `decreasing` at all, since the formula takes negative arguments which implies decreasing. – Ari B. Friedman Jul 26 '11 at 22:11
  • @gsk3, `sort.int` has `decreasing=...` only as the fourth parameter, so my guess is you can have `formula=...` as your second. I suspect you can also use `decreasing=NULL` and ignore this parameter in your code (in the same way that `sort.int` ignores `decreasing` when `partial=TRUE`). PS. All of this can be found in `?sort`. – Andrie Jul 26 '11 at 22:42
  • @Andrie, even if you flip the order, because `decreasing` is named second in the generic function, it grabs the positional argument. So it doesn't help, sadly. – Ari B. Friedman Jul 26 '11 at 23:00
  • @Andrie `sort.int` is not method of `sort`. There is no class `int`. You could see implemented methods with `methods(sort)`. – Marek Jul 27 '11 at 05:52