6

Imagine that I have a list of numbers (i.e. numbers column in data.table/data.frame).

1
5
5
10
11
12

for each number in a list a want to count how many unique numbers are there which are lower than that particular number + 5.

The explanation for upper case, first number = 1, search range is 1+5 = 6, so three numbers are in range, less than or equal to: c(1,5,5), and then count unique is 2. This is all by assuming we've got the additional condition, that the number must not only be lower than current_number + 5, but also its index in the list must be >= that of current_number.

The result in this case would be:

2
2
2
3
2
1

Note: Is there a fast solution for huge dataset, in data.frame or data.table? My dataset is rather huge, 10+M rows.

Makaroni
  • 880
  • 3
  • 15
  • 34
  • 2
    I don't get the logic. What does search range mean here? For the first element (`1`) the comparison value is `6`. 3 unique values in the vector are higher than 6, but your result is `2`. Where do I misunderstand you? – Georgery Feb 14 '20 at 14:38
  • 3
    Clarify how you are getting 3 on 4th row? – zx8754 Feb 14 '20 at 14:40
  • 1
    Shouldn't it supposed to be `2, 2, 2, 5, 5, 5`?...Also I think you mean less than..? because `1, 5, 5` are less than 6..? I don't know. I m confused – Sotos Feb 14 '20 at 14:41
  • @Georgery I meant lower, sorry! – Makaroni Feb 14 '20 at 14:44
  • 2
    After the "lower" edit, let's look at the last number, 12. 12 + 5 = 17. All the numbers are *lower than* 17, so the set is `{1, 5, 5, 10, 11, 12}`, and the number of unique numbers in that set is 5. Why is your desired result 2? Perhaps, the search range is `[x, x+5]` rather than `(-Inf, x+5]`? Also, please clarify by "lower than" if you mean `<` or `<=`. – Gregor Thomas Feb 14 '20 at 14:45
  • @Gregor-reinstateMonica, I didn't explained it well. :( when you calculate this for one number, it is dropped from the database. It's like you have an apply over data.table. – Makaroni Feb 14 '20 at 14:47
  • Oh, so it's counting how many numbers *after x in the input* are `<` (or maybe `<=`) `x + 5`? – Gregor Thomas Feb 14 '20 at 14:48
  • @Gregor-reinstateMonica, Yes <=(x+5). Sorry for the confusion. – Makaroni Feb 14 '20 at 14:49

4 Answers4

7

The fastest way I can think of in base R (works if x is sorted):

findInterval(x + 5, unique(x)) - cumsum(!duplicated(x)) + 1L
#[1] 2 2 2 3 2 1

edit: no problem with the sorting because with data.table, sorting integers is trivial:

nr <- 1e7
nn <- nr/2
set.seed(0L)
DT <- data.table(X=sample(nn, nr, TRUE))
#DT <- data.table(X=c(1,5,5,10,11,12))

system.time(
    DT[order(X), 
        COUNT := findInterval(X + 5L, unique(X)) - cumsum(!duplicated(X)) + 1L
    ]
)
#   user  system elapsed 
#   1.73    0.17    1.53 

2s for 10million rows.

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
nicola
  • 24,005
  • 3
  • 35
  • 56
6

Try this:

x <- c(1,5,5,10,11,12)

sapply(seq_along(x), function(i)
  sum(unique(x[i:length(x)]) <= (x[i] + 5)))
# [1] 2 2 2 3 2 1
zx8754
  • 52,746
  • 12
  • 114
  • 209
2

One option is to use a sql self-join

library(sqldf)


df$r <- seq(nrow(df))

sqldf('
select    a.V1
          , count(distinct b.V1) as n
from      df a
          left join df b
            on  b.V1 <= a.V1 + 5
                and b.r >= a.r
group by  a.r
')

#   V1 n
# 1  1 2
# 2  5 2
# 3  5 2
# 4 10 3
# 5 11 2
# 6 12 1

Data used:

df <- structure(list(V1 = c(1L, 5L, 5L, 10L, 11L, 12L)), row.names = c(NA, 
-6L), class = "data.frame")
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
0
sapply(yourVector + 5, function(x, y) sum(x > y), y = unique(x))
Georgery
  • 7,643
  • 1
  • 19
  • 52
  • 1
    As far as I get this is the logic you are describing. However, your expected result shows something else. Please, clarify. – Georgery Feb 14 '20 at 14:47