Rank by two columns and keep ties

Question

My question is continuation of this probem Link

I have a dataset such as this one:

 ID    |     Date 

  A        01/01/2015
  A        02/01/2015
  A        02/01/2015
  A        02/01/2015
  A        05/01/2015     
  B        01/01/2015

I want to rank each date by a referential date - 31/01/2015. The closest date to the referential date being ranked 1, second 2, and so on. The result would look like:

  ID    |     Date           |  Sequence

  A        01/01/2015           3
  A        02/01/2015           2
  A        02/01/2015           2
  A        02/01/2015           2
  A        05/01/2015           1  
  B        01/01/2015          ...

While the rank function does think, I also want to keep all the ties. How do I do that?

Also, I am working with a huge dataset - approx. 300 million rows. So the solution would ideally be fast.

Fyi, frank + ties.method="dense" has the analogue dense_rank in dplyr. — Frank, Jan 03 '17 at 17:42
Possible duplicate of [How I can create a new ties.method with the R rank() function?](http://stackoverflow.com/questions/3197999/how-i-can-create-a-new-ties-method-with-the-r-rank-function) — Barker, Jan 03 '17 at 19:00

score 2 · Accepted Answer · answered Jan 03 '17 at 16:59

We can use frank from data.table with dense as ties.method after grouping by 'ID' on the absolute difference between the 'Date' and the reference date ('2015-01-31')

library(data.table)
setDT(df)[, Sequence := frank(abs(as.IDate(Date, "%d/%m/%Y")- 
              as.IDate("2015-01-31")), ties.method = "dense"), by = ID]
df
#    ID       Date Sequence
#1:  A 01/01/2015        3
#2:  A 02/01/2015        2
#3:  A 02/01/2015        2
#4:  A 02/01/2015        2
#5:  A 05/01/2015        1
#6:  B 01/01/2015        1

data

df <- structure(list(ID = c("A", "A", "A", "A", "A", "B"), Date = c("01/01/2015", 
 "02/01/2015", "02/01/2015", "02/01/2015", "05/01/2015", "01/01/2015"
)), .Names = c("ID", "Date"), class = "data.frame", row.names = c(NA, 
-6L))

Thank you. Works great. I wasnt actually aware of the frank function. — Prometheus, Jan 03 '17 at 17:11
Just a quick follow-up. Could we do something like this: setDT(df)[, Sequence := frank(Date, ties.method = "random"), by = ID] to speed up the processing? It works perfectly with smaller objects. But with the huge ones it doesnt move anyone. — Prometheus, Jan 04 '17 at 17:11

score 2 · Answer 2 · answered Jan 03 '17 at 17:43

With dplyr dense_rank:

library(dplyr)
df$Sequence <- dense_rank(as.numeric(as.Date('31/01/2015', '%d/%m/%Y') - as.Date(df$Date, '%d/%m/%Y')))
head(df) 

  ID       Date Sequence
1  A 01/01/2015        3
2  A 02/01/2015        2
3  A 02/01/2015        2
4  A 02/01/2015        2
5  A 05/01/2015        1
6  B 01/01/2015        3

lmo · Answer 3 · 2017-01-03T18:33:52.470

1

Here is a data.table method that would work.

rleid returns "IDs" for the same date by group ID. However, these IDs count up from 0. In the second chain, [, (max(var) - var) + 1L reverses these date IDs for each ID group.

df[, var:=rleid(Date), by=ID][, var := (max(var) - var) + 1L, by=ID]
df
   ID       Date var
1:  A 01/01/2015   3
2:  A 02/01/2015   2
3:  A 02/01/2015   2
4:  A 02/01/2015   2
5:  A 05/01/2015   1
6:  B 01/01/2015   1

edited Jan 03 '17 at 18:33

answered Jan 03 '17 at 16:49

lmo

37,904
9
56
69

Thank you for the response. I was thinking of something like this: var := (max(var) - var) + 1L, by=ID, but I wasnt sure how to implement it. – Prometheus Jan 03 '17 at 17:10

score 1 · Answer 4 · edited May 23 '17 at 10:30

Base R solution. First get your days and the target date by converting them to Date objects and taking the absolute value of the difference.

timediff <- abs(as.Date(df[["Date"]], format = "%d/%m/%Y") - as.Date("2015-01-31"))

Next we can use rank to get the order of them. We can use any ties.method that produces a single value for the ties, but "min" or "max" would probably be best since they output integers.

diffrank <- rank(timediff, ties.method = "min")

Finally we can use this solution for reordering the rank to remove gaps between instances.

df[["Sequence"]] <- as.numeric(factor(diffrank))

If you would like, this can all be done in one line:

df[["Sequence"]] <- as.numeric(factor(rank(
                        abs(as.Date(df[["Date"]], format = "%d/%m/%Y") - 
                               as.Date("2015-01-31")), ties.method = "min")))

I will try it on a smaller dataset. But I think it would be somewhat slower than the data.table solution. — Prometheus, Jan 04 '17 at 08:59

Rank by two columns and keep ties

4 Answers4

data

Linked