Count numbers that are lower by constant from current number

Question

Imagine that I have a list of numbers (i.e. numbers column in data.table/data.frame).

for each number in a list a want to count how many unique numbers are there which are lower than that particular number + 5.

The explanation for upper case, first number = 1, search range is 1+5 = 6, so three numbers are in range, less than or equal to: c(1,5,5), and then count unique is 2. This is all by assuming we've got the additional condition, that the number must not only be lower than current_number + 5, but also its index in the list must be >= that of current_number.

The result in this case would be:

Note: Is there a fast solution for huge dataset, in data.frame or data.table? My dataset is rather huge, 10+M rows.

I don't get the logic. What does search range mean here? For the first element (`1`) the comparison value is `6`. 3 unique values in the vector are higher than 6, but your result is `2`. Where do I misunderstand you? — Georgery, Feb 14 '20 at 14:38
Shouldn't it supposed to be `2, 2, 2, 5, 5, 5`?...Also I think you mean less than..? because `1, 5, 5` are less than 6..? I don't know. I m confused — Sotos, Feb 14 '20 at 14:41
After the "lower" edit, let's look at the last number, 12. 12 + 5 = 17. All the numbers are *lower than* 17, so the set is `{1, 5, 5, 10, 11, 12}`, and the number of unique numbers in that set is 5. Why is your desired result 2? Perhaps, the search range is `[x, x+5]` rather than `(-Inf, x+5]`? Also, please clarify by "lower than" if you mean `<` or `<=`. — Gregor Thomas, Feb 14 '20 at 14:45
@Gregor-reinstateMonica, I didn't explained it well. :( when you calculate this for one number, it is dropped from the database. It's like you have an apply over data.table. — Makaroni, Feb 14 '20 at 14:47
Oh, so it's counting how many numbers *after x in the input* are `<` (or maybe `<=`) `x + 5`? — Gregor Thomas, Feb 14 '20 at 14:48
@Gregor-reinstateMonica, Yes <=(x+5). Sorry for the confusion. — Makaroni, Feb 14 '20 at 14:49

score 7 · Accepted Answer · edited Feb 14 '20 at 23:32

The fastest way I can think of in base R (works if x is sorted):

findInterval(x + 5, unique(x)) - cumsum(!duplicated(x)) + 1L
#[1] 2 2 2 3 2 1

edit: no problem with the sorting because with data.table, sorting integers is trivial:

nr <- 1e7
nn <- nr/2
set.seed(0L)
DT <- data.table(X=sample(nn, nr, TRUE))
#DT <- data.table(X=c(1,5,5,10,11,12))

system.time(
    DT[order(X), 
        COUNT := findInterval(X + 5L, unique(X)) - cumsum(!duplicated(X)) + 1L
    ]
)
#   user  system elapsed 
#   1.73    0.17    1.53

2s for 10million rows.

score 6 · Answer 2 · answered Feb 14 '20 at 14:59

6

Try this:

x <- c(1,5,5,10,11,12)

sapply(seq_along(x), function(i)
  sum(unique(x[i:length(x)]) <= (x[i] + 5)))
# [1] 2 2 2 3 2 1

answered Feb 14 '20 at 14:59

zx8754

52,746
12
114
209

2

With 10M rows, not sure this is the right solution. Any data.table experts? – zx8754 Feb 14 '20 at 15:17

IceCreamToucan · Answer 3 · 2020-02-14T14:50:36.997

One option is to use a sql self-join

library(sqldf)


df$r <- seq(nrow(df))

sqldf('
select    a.V1
          , count(distinct b.V1) as n
from      df a
          left join df b
            on  b.V1 <= a.V1 + 5
                and b.r >= a.r
group by  a.r
')

#   V1 n
# 1  1 2
# 2  5 2
# 3  5 2
# 4 10 3
# 5 11 2
# 6 12 1

Data used:

df <- structure(list(V1 = c(1L, 5L, 5L, 10L, 11L, 12L)), row.names = c(NA, 
-6L), class = "data.frame")

score 0 · Answer 4 · answered Feb 14 '20 at 14:47

0

sapply(yourVector + 5, function(x, y) sum(x > y), y = unique(x))

answered Feb 14 '20 at 14:47

Georgery

7,643
1
19
52

1

As far as I get this is the logic you are describing. However, your expected result shows something else. Please, clarify. – Georgery Feb 14 '20 at 14:47

Count numbers that are lower by constant from current number

4 Answers4