60

I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:

sort by ascending Year, ascending MemberID, descending Month 

Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:

setkey(X, Year, MemberID)

This is very fast and takes only a few second.

X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]

This step takes so much longer (5 minutes).

Update: Someone made a comment to do X <- X[sort(Year, MemberID, -Month)] and later deleted. This approach seems to be much faster:

user  system elapsed 
5.560  11.242  66.236 

My approach: setkey() then order(-Month)

   user  system elapsed 
816.144   9.648 848.798 

My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?

Update 2: to response to Matthew Dowle:

After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:

monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]

Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:

   user  system elapsed 
  7.910   7.750  53.916 

Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
AdamNYC
  • 19,887
  • 29
  • 98
  • 154

2 Answers2

87

Update June 5 2014:

The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.

In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.

Benchmarks:

Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:

require(data.table) ## 1.9.3
set.seed(1L)
DT <- data.table(Year     = sample(1950:2000, 3e6, TRUE), 
                 memberID = sample(paste0("V", 1:1e4), 3e6, TRUE), 
                 month    = sample(12, 3e6, TRUE))

## using base:::order
system.time(ans1 <- DT[base:::order(Year, memberID, -month)])
#   user  system elapsed 
# 76.909   0.262  81.266 

## optimised to use data.table's fast order
system.time(ans2 <- DT[order(Year, memberID, -month)])
#   user  system elapsed 
#  0.985   0.030   1.027

## reorders by reference
system.time(setorder(DT, Year, memberID, -month))
#   user  system elapsed 
#  0.585   0.013   0.600 

## or alternatively
## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))

## are they equal?
identical(ans2, DT)    # [1] TRUE
identical(ans1, ans2)  # [1] TRUE

On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.

data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].

All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.


NOTE: Please refer to the history/revisions for earlier answer and updates.

Arun
  • 116,683
  • 26
  • 284
  • 387
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • @AdamNYC NP. You seem to writing `sort` when you mean `order`? Binary search is to do with `i` not `by`, so I'm not sure what you mean in the edit. I don't think it will be slow, and what you're doing seems correct. – Matt Dowle Dec 03 '12 at 16:10
  • Thanks, Matthew. I corrected my update. Your explanation is clear. – AdamNYC Dec 03 '12 at 16:12
  • @AdamNYC Ok but 66s feels slow. Do try option 1, just for completeness. – Matt Dowle Dec 03 '12 at 16:15
  • Hi Matthew, I updated my question per your suggestion. BTW, I can't thank you enough for such a terrific package. – AdamNYC Dec 03 '12 at 17:08
  • @AdamNYC Hm. That 53s looks odd, given the 7s user time. How did you time that and what exactly did you time? A vector 3 million long should change sign quicker than that. – Matt Dowle Dec 03 '12 at 17:33
  • Hi Matt, here is the code I use. My table has 120 variables though `> system.time({ c[, Month := -Month]; setkey(c, Year, MemberID, Month); c[, Month := -Month] } ) user system elapsed 7.910 7.750 53.916` – AdamNYC Dec 03 '12 at 17:37
  • @AdamNYC thanks but that doesn't tally with what you wrote as edit to question. How much is most? – Matt Dowle Dec 03 '12 at 18:27
  • Hi Matt, I'll time the assignment call shortly. I ran it before and it feels long. :-) – AdamNYC Dec 03 '12 at 18:30
  • Sorry, my bio-clock didn't work: here is the breakdown of each step:` reverse sign: user system elapsed 0.021 0.019 0.059 setkey: user system elapsed 7.842 7.411 48.878 ` – AdamNYC Dec 03 '12 at 18:39
  • @AdamNYC I might have an inkling why that is. Have raised [FR#2419 fastorder should iterate left to right](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2419&group_id=240&atid=978) to remind me to come back to it. Glad it wasn't the sign change after all, though. – Matt Dowle Dec 03 '12 at 21:09
  • Thanks a lot, Matthew. I am totally converted to data.table now. – AdamNYC Dec 03 '12 at 21:24
  • 1
    @MatthewDowle Would it be possible to add the ability to set the sort order as descending in `setkey`? For example, option 1.b would simply be `setkey(X,Year,MemberID,-Month)`. – dnlbrky Apr 23 '13 at 13:02
  • @dnlbrky `setkey` is ascending order always I'm afraid and that's set in stone (relied on in binary search to keep things manageable internally). I was just about to suggest `data.table:::fastorder` but that takes column names not `-Month` unfortunately. – Matt Dowle Apr 23 '13 at 13:10
  • @MattDowle Is this answer still up-to-date? There does not seem to be much difference between `base:::order()` and `data.table:::forderv()` today. Am I wrong? (Tested on vectors, not on data frames.) – Arthur May 01 '20 at 12:09
  • @arthur The answer indeed needs updating. data.table's sort code was included in R 3.3.0 released May 2016. – Matt Dowle Jul 14 '22 at 17:22
16

The comment was mine, so I'll post the answer. I removed it because I couldn't test whether it was equivalent to what you already had. Glad to hear it's faster.

X <- X[order(Year, MemberID, -Month)]

Summarizing shouldn't depend on the order of your rows.

Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113