How to emulate SQLs rank functions in R?

Question

What is the R equivalent of rank functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() ("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?

I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.

Perhaps if you described what those procedures do, others not familiar with them can help you out as well. I suspect this is why you're getting the downvotes as well. Try expanding your question. — Roman Luštrik, Jul 12 '12 at 07:24
I believe the poster is referring to Oracle Analytic Function http://www.orafaq.com/node/55. — Tommy O'Dell, Jul 12 '12 at 08:10

score 30 · Accepted Answer · edited May 23 '17 at 12:10

The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.

Here's an example: Load the latest version of data.table from R-Forge:

install.packages("data.table",
  repos= c("http://R-Forge.R-project.org", getOption("repos")))

library(data.table)

Create some example data:

set.seed(10)

DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

> DT
    ID group       value info
 1:  1     1  0.01874617    a
 2:  2     1 -0.18425254    b
 3:  3     1 -1.37133055    b
 4:  4     2 -0.59916772    a
 5:  5     2  0.29454513    b
 6:  6     2  0.38979430    a
 7:  7     3 -1.20807618    b
 8:  8     3 -0.36367602    a
 9:  9     3 -1.62667268    c
10: 10     4 -0.25647839    d
11: 11     4  1.10177950    c
12: 12     4  0.75578151    d

Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):

> DT[,valRank:=rank(-value),by="group"]
    ID group       value info valRank
 1:  1     1  0.01874617    a       1
 2:  2     1 -0.18425254    b       2
 3:  3     1 -1.37133055    b       3
 4:  4     2 -0.59916772    a       3
 5:  5     2  0.29454513    b       2
 6:  6     2  0.38979430    a       1
 7:  7     3 -1.20807618    b       2
 8:  8     3 -0.36367602    a       1
 9:  9     3 -1.62667268    c       3
10: 10     4 -0.25647839    d       3
11: 11     4  1.10177950    c       1
12: 12     4  0.75578151    d       2

For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):

DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]

R> DT
    ID group       value info valRank infoRank infoRankDense
 1:  1     1  0.01874617    a       1        1             1
 2:  2     1 -0.18425254    b       2        2             2
 3:  3     1 -1.37133055    b       3        2             2
 4:  4     2 -0.59916772    a       3        1             1
 5:  5     2  0.29454513    b       2        3             2
 6:  6     2  0.38979430    a       1        1             1
 7:  7     3 -1.20807618    b       2        2             2
 8:  8     3 -0.36367602    a       1        1             1
 9:  9     3 -1.62667268    c       3        3             3
10: 10     4 -0.25647839    d       3        2             2
11: 11     4  1.10177950    c       1        1             1
12: 12     4  0.75578151    d       2        2             2

p.s. Hi Matthew Dowle.

LEAD and LAG

For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:

set.seed(10)

DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)

> DT
    ID group       value info idRank
 1:  4     1 -0.36367602    b      1
 2:  5     1 -1.62667268    b      2
 3:  7     1 -1.20807618    b      3
 4:  1     2  1.10177950    a      1
 5:  2     2  0.75578151    a      2
 6: 12     2 -0.25647839    b      3
 7:  3     3  0.74139013    c      1
 8:  6     3  0.98744470    b      2
 9:  9     3 -0.23823356    a      3
10:  8     4 -0.19515038    c      1
11: 10     4  0.08934727    c      2
12: 11     4 -0.95494386    c      3

Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.

DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]

    ID group       value info idRank        prev      prev2
 1:  4     1 -0.36367602    b      1          NA         NA
 2:  5     1 -1.62667268    b      2 -0.36367602         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760
 4:  1     2  1.10177950    a      1          NA         NA
 5:  2     2  0.75578151    a      2  1.10177950         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795
 7:  3     3  0.74139013    c      1          NA         NA
 8:  6     3  0.98744470    b      2  0.74139013         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901
10:  8     4 -0.19515038    c      1          NA         NA
11: 10     4  0.08934727    c      2 -0.19515038         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504

For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':

DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]

    ID group       value info idRank        prev      prev2         nex       nex2
 1:  4     1 -0.36367602    b      1          NA         NA -1.62667268 -1.2080762
 2:  5     1 -1.62667268    b      2 -0.36367602         NA -1.20807618         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760          NA         NA
 4:  1     2  1.10177950    a      1          NA         NA  0.75578151 -0.2564784
 5:  2     2  0.75578151    a      2  1.10177950         NA -0.25647839         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795          NA         NA
 7:  3     3  0.74139013    c      1          NA         NA  0.98744470 -0.2382336
 8:  6     3  0.98744470    b      2  0.74139013         NA -0.23823356         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901          NA         NA
10:  8     4 -0.19515038    c      1          NA         NA  0.08934727 -0.9549439
11: 10     4  0.08934727    c      2 -0.19515038         NA -0.95494386         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504          NA         NA

The := with by operator is amazing. It would be good if the Oracle LEAD and LAG functions were to be implemented in future release of data.table. By the way, it seems that you need R 2.15 to install directly from the R-Forge repository. — Wei, Jul 13 '12 at 01:31
@Wei, for an example of mimicing LAG with data.table, you could see [this answer](http://stackoverflow.com/a/10708124/1281189), in particular the part adding the PREV column. — BenBarnes, Jul 13 '12 at 06:12
@Wei, also, to install package binaries from R-Forge, you'll need the most up-to-date version of R, but you should be able to install packages from source and load them using the minimum R version required by the package itself. — BenBarnes, Jul 13 '12 at 06:19
@BenBarnes I find the post you referred to a bit difficult to understand, for it involves another requirement. I believe the gist of that solution is as follows: `setkey(DT,group, ID); DT[,prev:=DT[J(group,ID-1), value, mult='last']]; DT[,nex:=DT[J(group,ID+1),value, mult='first']]` Of course, the ID column should be added if it is not already there. Changing -1 to -2 will get LEAD by 2. — Wei, Jul 17 '12 at 06:39
Hi @Wei, please see my edits above. You are correct with having to create an appropriate ID variable, and had the right gist. — BenBarnes, Jul 19 '12 at 13:03

score 7 · Answer 2 · answered Jan 25 '15 at 20:20

From data.table v1.9.5+, function frank() (for fast rank) has been implemented. frank() is useful in interactive scenarios, where as frankv() allows to easily program with.

It implements every operation available in base::rank. In addition, the advantages are:

frank() operates on list, data.frames and data.tables in addition to atomic vectors.
We can specify, for each column, whether rank should be computed on increasing or decreasing order.
It also implements rank type dense in addition to other types in base.
You can use - on a character column as well to rank by decreasing order.

Here's an illustration of all the above points using the same data.table DT from @BenBarnes' (excellent) post.

data:

require(data.table)
set.seed(10)
sample_n <- function(x, n) sample(x, n, replace=TRUE)
DT <- data.table(
        ID = seq_len(4*3),
        group = rep(1:4,each=3),
        value = rnorm(4*3),
        info = c(sample_n(letters[1:2], 8), sample_n(letters[3:4], 4)))

On single columns:

Compute dense rank:

DT[, rank := frank(value, ties.method="dense"), by=group]

You can also use the other methods min, max, random, average and first.

In decreasing order:

DT[, rank := frank(-value, ties.method="dense"), by=group]

Using frankv, similar to frank:

# increasing order
frankv(DT, "value", ties.method="dense")

# decreasing order
frankv(DT, "value", order=-1L, ties.method="dense")

On multiple columns

You can use .SD, which stands for Subset of Data and contains the data corresponding to that group. See the Introduction to data.table HTML vignette for more on .SD.

Rank by info, value columns while grouping by group:

DT[, rank := frank(.SD,  info, value, ties.method="dense"), by=group]

Use - to specify decreasing order:

DT[, rank := frank(.SD,  info, -value, ties.method="dense"), by=group]

You can also use - directly on character columns

DT[, rank := frank(.SD, -info, -value, ties.method="dense"), by=group]

You can use frankv similarly and provide the columns to cols argument and the order by which the columns should be ranked using the order argument.

Small benchmark to compare with base::rank:

set.seed(45L)
x = sample(1e4, 1e7, TRUE)
system.time(ans1 <- base::rank(x, ties.method="first"))
#    user  system elapsed 
#  22.200   0.255  22.536 
system.time(ans2 <- frank(x, ties.method="first"))
#    user  system elapsed 
#   0.745   0.014   0.762 
identical(ans1, ans2) # [1] TRUE

joran · Answer 3 · 2012-08-16T14:25:34.773

3

I like data.table as much as the next guy, but it isn't always necessary. data.table will always be faster, but even for moderately large data sets if the number of groups is fairly small, plyr will still perform adequately.

What BenBarnes did using data.tables can be done just as compactly (but as I noted before probably slower in many cases) using plyr:

library(plyr)                
ddply(DT,.(group),transform,valRank = rank(-value))
ddply(DT,.(group),transform,valRank = rank(info,ties.method = "min"),
                            valRankDense = as.integer(factor(info)))

and even without loading a single extra package at all:

do.call(rbind,by(DT,DT$group,transform,valRank = rank(-value)))
do.call(rbind,by(DT,DT$group,transform,valRank = rank(info,ties.method = "min"),
                                        valRankDense = as.integer(factor(info))))

although you do lose some of the syntactic niceties in that last case.

edited Aug 16 '12 at 14:25

answered Jul 13 '12 at 03:31

joran

169,992
32
429
468

1

`transform` by group is very, very slow. However you do it. It scales so badly that I felt I had to comment. – Matt Dowle Jul 14 '12 at 00:26
@MatthewDowle Fair enough. Though I guess I feel like I should comment that not everyone handles data large enough for the speed difference to be meaningful. – joran Jul 14 '12 at 04:55
Fair enough too. Just added a (straw man) benchmark that attempts to quantify where the difference starts to become significant. See what you think. – Matt Dowle Jul 20 '12 at 03:01
@MatthewDowle The next time I have to do a group by transformation on data where each group consists of only 2 rows, I'll be sure to turn to data.table. Here in the real world, though, where 500 groups in a 1e6 data set is a lot for me, I think I'll be just fine with my plyr and base code which runs in well under a second. (Compared to essentially instantaneously for data.table, to be sure, but still...) – joran Jul 20 '12 at 03:48
@MatthewDowle I struggled with how to respond to your edit. First, I felt like your edit was arguing _against_ the points in my answer, and so it really belonged in a separate answer itself, rather than in mine. Second, I disagree strongly that that was a useful benchmark. As I mentioned above, your example considered grouping variables with hundreds or thousands of levels. That's hardly realistic in most cases. If you change your example to have 50 groups each time the differences are on the order of 0.1 seconds. – joran Aug 16 '12 at 14:00
The edit was about transform-by-group (only), but the question is much more general. So yes I was responding just to your answer, and it was too big for a comment. I don't think many-small-groups is _that_ uncommon. Grouping a set of time series by minute or by hour, for example. Pairs of rows does crop up quite a lot, too, where the 'group' is an event followed by another event, for example. That the benchmark isn't realistic for you is fine, but it might be realisic for others. It certainly is for me. – Matt Dowle Aug 16 '12 at 14:17
@MatthewDowle Fair enough. I edited my answer slightly to address our disagreement (but I don't actually think we're disagreeing all that much...) – joran Aug 16 '12 at 14:26
1

Linking to [this question](http://stackoverflow.com/questions/12539248/calculate-the-difference-betwen-pairs-of-consecutive-rows-in-a-data-frame-r) where pairs of rows seems relevant. – Matt Dowle Sep 26 '12 at 15:30

score 1 · Answer 4 · answered Dec 04 '20 at 20:19

Dplyr now has windows functions including row_number and dense_rank: https://dplyr.tidyverse.org/reference/ranking.html:

df <- tibble::tribble(
~subjects,        ~date, ~visits, 
     1L, "21/09/1999",      2L, 
     1L, "29/04/1999",      4L, 
     2L, "18/02/1999",     15L, 
     3L, "10/07/1999",     13L, 
     4L, "27/08/1999",      7L, 
     7L, "27/10/1999",     14L, 
    10L, "18/04/1999",      8L, 
    13L, "27/09/1999",     14L, 
    14L, "15/09/1999",      6L, 
    16L, "27/11/1999",     14L, 
    20L, "06/02/1999",      4L, 
    22L, "07/09/1999",     12L, 
    23L, "24/03/1999",     14L, 
    24L, "19/01/1999",      7L, 
 )

Note ORDER BY does not need to be stipulated unlike in the ROW_NUMBER() SQL code.

df_partition <- df %>% 
  group_by(subjects) %>% # group_by is equivalent to GROUP BY in the SQL partition 
ROW_NUMBER()
  mutate(rn = row_number(visits),
         rn_reversed = row_number(desc(visits))) %>% 
ungroup() %>% # grouping by subjects remains on data unless removed like this
  mutate(dense_rank = dense_rank(visits))

score -6 · Answer 5 · answered Jul 12 '12 at 08:14

I don't think there's a direct equivalent to Oracle's Analytic functions. Plyr will likely be able to achieve some of the analytic functions, but not all directly. I'm sure R can replicate each function separately but I don't think there's a single package that will do it all.

If there's a specific operation you need to achieve in R, then do some googling, and if you come up empty, ask a specific question here on StackOverflow.

How to emulate SQLs rank functions in R?

5 Answers5

data:

On single columns:

On multiple columns

Linked

Related