17

I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):

n <- 200000
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 10000), 'A'] <- NA
v <- 1

I have v as one here, but in reality it could change, and full has many columns.


I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:

# DATA FRAME

ord_df <- function() {
  a <- full[c(v, length(full))]
  a[with(a, order(a[1])), ]
}

sl_df <- function() {
  a <- full[c(v, length(full))]
  a[sort.list(a[[1]]), ] 
}


# DATA TABLE

require(data.table)

ord_dt <- function() {
  a <- as.data.table(full[c(v, length(full))])
  colnames(a)[1] <- 'values'
  a[order(values)]
}

sl_dt <- function() {
 a <- as.data.table(full[c(v, length(full))])
 colnames(a)[1] <- 'values'
 a[sort.list(values)]
}


# MATRIX

ord_mat <- function() {
  a <- as.matrix(full[c(v, length(full))])
  a[order(a[, 1]), ] 
}

sl_mat <- function() {
  a <- as.matrix(full[c(v, length(full))])
  a[sort.list(a[, 1]), ] 
}

Time results:

         ord_df  sl_df    ord_dt   sl_dt   ord_mat sl_mat
Min.     0.230   0.1500   0.1300   0.120   0.140   0.1400
Median   0.250   0.1600   0.1400   0.140   0.140   0.1400
Mean     0.244   0.1610   0.1430   0.136   0.142   0.1450
Max.     0.250   0.1700   0.1600   0.140   0.160   0.1600

Or using microbenchmark (results are in milliseconds):

             min      lq       median   uq       max
1  ord_df() 243.0647 248.2768 254.0544 265.2589 352.3984
2  ord_dt() 133.8159 140.0111 143.8202 148.4957 181.2647
3 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.6897
4   sl_df() 152.6985 161.5591 166.5147 171.2891 194.7155
5   sl_dt() 132.1414 139.7655 144.1281 149.6844 188.8592
6  sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416

Seems like ordering the data table wins. There isn't all that much difference between order and sort.list except when using data frames where sort.list is much faster.

In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn't get it work since the contents of v are not integer.

I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.


Here's the code I used for timing if it's useful to anyone:

sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)

require(plyr)
timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))
colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')
apply(timings, 2, summary) 

require(microbenchmark)
mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())
plot(mb)
Community
  • 1
  • 1
Fist
  • 443
  • 3
  • 9
  • I don't know your exact specs, but if they allow it, consider running things like `full <- as.matrix(full)` or `full <- as.data.table(full)` once for all outside of your sorting functions. These transformations may have a significant impact on your computation times while they are not really doing the sorting. – flodel Jul 19 '12 at 11:04
  • Well `full` is originally a data frame before any sorting takes place (I can't change that), so converting it to a matrix or data table is part of the sorting method so needs to be included in the timings I think. – Fist Jul 19 '12 at 11:10
  • 4
    Please upgrade to data.table v1.8.2, now on CRAN. One comment was that you couldn't set a key on a non integer column - that is now ok in 1.8.2, see NEWS. Setting a key on v should be faster, yes. Also, you seem to including the time to convert to data.table in the time to sort. Is that fair? However, you might find that in 1.8.2 the time to convert is much faster than in 1.8.0. +1 btw. Nice question. – Matt Dowle Jul 19 '12 at 11:29
  • http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns-in-r/6871968#6871968 for data.frame fastest methods. – Ari B. Friedman Jul 19 '12 at 11:40
  • 4
    You may also need more than 200,000 rows. I generally roll my eyes whenever I see significant differences of insignificant times being reported. You're saying data.table wins on the basis of 0.13 secs vs 0.23 sec? I'd say that's insignificant. See vignette("datatable-timings") for what I consider to be good benchmarks. 8.8secs reduced to 0.013 (678 times faster) and 22.6 secs reduced to 1.1 sec (19 times fatser). Those are significant differences of significant times. Tasks where you (almost) have enough time to make a cup of tea while you wait. – Matt Dowle Jul 19 '12 at 11:55
  • In other words, conclusions drawn on small timings do _not_ hold when scaled. Your benchmark does actually need to take a significant amount of time, before you can draw conclusions. Unless in the real world you really do care about 0.1 second, which is possible of course, but you didn't say. Please confirm the scale that matters to you. – Matt Dowle Jul 19 '12 at 11:59
  • 1
    For `ord_df`, no need for `with`. Your speed differences are due to ordering a data frame `order(a[1])` rather than a vector `order(a[[1]])`; `order(a[[1]])` is like `sort.list(a[[1]])`. If the column to order is actually integer valued then `sort.list(..., method="radix")` is very fast (faster than data.table?). – Martin Morgan Jul 19 '12 at 12:00
  • @Martin `data.table` uses `sort.list(...,method="radix")`. As noted in `?setkey`. – Matt Dowle Jul 19 '12 at 12:02
  • Re: your answer to my comment. In your question, you say you have to do this sorting for many different values of `v`. I maintain that you should transform `full` to (for example) a `data.table` only once, hence no need to make `as.data.table` part of your computation times. – flodel Jul 19 '12 at 12:08
  • @MatthewDowle but for `a$B = as.integer(a$B)` `a[sort.list(a$B, method="radix"),]` is about 2x faster than `b[order(B),]`, where `b = as.data.table(a)`, for me, on this data set. – Martin Morgan Jul 19 '12 at 12:08
  • That reminds me. Try testing speed of sorting character vectors. Either `setkey` on a character column, or using `?chorder`. See `example(chmatch)` for timings. That's significantly faster than anything else available in R, afaik. – Matt Dowle Jul 19 '12 at 12:12
  • @Martin Yes. `setkey` uses radix as documented by `?setkey`. If `order` is passed as `i` then it does whatever it's passed. Btw, `a$B = as.integer(a$B)` is base syntax that copies the _whole_ of `a` just to change the type of one column. The datatable way is `a[,B:=as.integer(B)]` to change the type of `B` by reference with no copy at all, even once. – Matt Dowle Jul 19 '12 at 12:19
  • @Matthew You're right, I was working off data.table 1.7.2. I'll give 1.8.2 a go. I picked 200,000 rows as an example really. The data will likely have much more. I'll use a larger size. 0.1 of a second is probably insignificant generally but I will be running this process many times so any speed increase is significant in my case. @flodel Yes sorry you are right. I should be converting `full` only once. I'll take this out of the benchmarks. @Everyone I'll update the benchmarks and post it up when I have the new results (once I get data.table 1.8.2 installed, having trouble overwriting 1.7.2). – Fist Jul 19 '12 at 13:19
  • @Fist 1.7.2 !!?? That's 10 versions ago from Nov 2011. Some authors are sensitive to comparisons at the best of times (me more than most just because `data.table` does make claims to be fast), but reporting timings with a version that old is pushing it. If you are on Windows you need to close all your R sessions before you can overwrite the .dll – Matt Dowle Jul 19 '12 at 13:46
  • Sorry, I had such an old version because I'm living in the past and using R 2.13.2 (and I was setup to download packages from .../cran.r-project.org/bin/windows/contrib/2.13/). I can't upgrade R either. – Fist Jul 19 '12 at 14:15
  • Updated tests in the answer below. – Fist Jul 19 '12 at 22:44

1 Answers1

11

I don't know if it's better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:

n <- 1e7
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 100000), 'A'] <- NA

fdf <- full
fma <- as.matrix(full)
fdt <- as.data.table(full)
setnames(fdt, colnames(fdt)[1], 'values')

# DATA FRAME
ord_df <- function() { fdf[order(fdf[1]), ] }
sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }

# DATA TABLE
require(data.table)
ord_dt <- function() { fdt[order(values)] }

key_dt <- function() {
  setkey(fdt, values) 
  fdt
}

# MATRIX
ord_mat <- function() { fma[order(fma[, 1]), ] }
sl_mat <- function() { fma[sort.list(fma[, 1]), ] }

Results (using a different computer, R 2.13.1 and data.table 1.8.2):

         ord_df  sl_df   ord_dt  key_dt  ord_mat sl_mat
Min.     37.56   20.86   2.946   2.249   20.22   20.21
1st Qu.  37.73   21.15   2.962   2.255   20.54   20.59
Median   38.43   21.74   3.002   2.280   21.05   20.82
Mean     38.76   21.75   3.074   2.395   21.09   20.95
3rd Qu.  39.85   22.18   3.151   2.445   21.48   21.42
Max.     40.36   23.08   3.330   2.797   22.41   21.84

Sorting

So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I'd argue. Thanks for the help everyone.

Fist
  • 443
  • 3
  • 9
  • In your previous benchmark, `ord_dt` and `ord_mat` were roughly equivalent while now there is a huge difference. I'll be curious to find out what contributed most between upgrading your R version and packages, or removing the data class conversion from your computations. – flodel Jul 19 '12 at 23:55
  • isn't `ord_df` still implemented incorrectly, with `[` instead of `[[` sub-setting? – Martin Morgan Jul 20 '12 at 02:30
  • 1
    @flodel I was using 200,000 rows before, and 10 million now. There probably wasn't enough data to differentiate between the two methods. Also I was using a fairly old version of data.table. I don't think the class conversion had much influence. Converting is insignificant (`as.data.table()` takes less than 100 milliseconds on `full`, not much out of the 2-3 seconds needed overall). – Fist Jul 20 '12 at 08:30
  • @MartinMorgan `order(fdf[[1]])` and `sort.list(fdf[[1]])` are equivalent (in time) so there was no need to include both. – Fist Jul 20 '12 at 08:53
  • +1 This looks more like I'd expect. Thanks for upgrading and reporting new results so quickly. Btw, if you increase the number of columns, too, the speed gap should widen further. That's because once the sort order is determined, `setkey` reorders the other columns by reference reusing a small amount of working memory. Whereas the other methods copy the whole table. With 10 million rows and (just) 2 columns it's only 1e7 * 2 * 8 / 1024^3 = 0.15 GB. Once object size gets significant into the many GB (say 20GB in 64bit), that's what we mean by large. – Matt Dowle Jul 20 '12 at 10:51
  • @Fist yes but `order(fdf[1])` is a misguided implementation and implies somehow that `order` and `sort.list` have different performance characteristics, which in this case they don't. But whatever. – Martin Morgan Jul 20 '12 at 11:44