Find categorical indicator vector based on continuous thresholds

Question

I have a set of t thresholds that separate my data vector y into t-1 categories.

y <- runif(100)     # data vector
t <- c(0, 0.5, 1)   # threshold vector

In this example, category 1 corresponds to data points that satisfy 0 < y < 0.5 and category 2 corresponds to data points that satisfy 0.5 < y < 1. To find the corresponding vector of categories, a naive looping approach would be

nc <- length(t) - 1                       # number of categories
categories <- numeric(length=length(y))   # vector of categories

for(cc in 1:nc){    # loop over categories

lower <- t[cc]      # lower bound for category cc
upper <- t[cc + 1]  # upper bound for category cc

cc.log <- (lower < y) & (y < upper) # logical vector where y satisfies thresholds
categories[cc.log] <- cc            # assign active category where thresholds are satisfied

}

Is there an easier and scalable solution that takes as inputs the data vector y as well as the threshold vector t and returns the vector of categories categories?

Edit: Choosing akrun's solution as it is the fastest.

Unit: microseconds
         expr       min         lq       mean     median         uq       max neval
  akrun(y, t)   352.386   357.7325   382.8909   369.4925   380.1840  1295.361   100
 darren(y, t)   520.882   545.2580   600.2583   602.9905   639.5555   886.097   100
 myself(y, t) 11261.807 11415.7625 12403.3405 11653.3235 13218.9600 20399.890   100

akrun · Accepted Answer · 2020-02-13T19:25:18.923

1

An easy option is findInterval

categories2 <- findInterval(y, t)
all.equal(categories, categories2)
#[1] TRUE

edited Feb 13 '20 at 19:25

answered Feb 13 '20 at 19:15

akrun

874,273
37
540
662

From "how-to-answer: "Not all questions can or should be answered here. Save yourself some frustration and avoid trying to answer questions which... ...have already been asked and answered many times before." https://stackoverflow.com/search?q=user%3A3732271+findInterval – IRTFM Feb 14 '20 at 04:58
@42- thank you for the comment. But, here is the question is also about modifying the OP's for loop to find an elegant solution. It is not basically a question about findInterval per se – akrun Feb 14 '20 at 16:29
@42- This comment is applicable to everybody. But, unfortunately, very few follow it. e.g. [here](https://stackoverflow.com/questions/60229489/splitting-list-into-its-components-and-combine-to-form-another-list-in-r/60229508#60229508) or [here](https://stackoverflow.com/questions/60227193/splitting-character-string-to-extract-date-and-time/60227698#60227698) – akrun Feb 14 '20 at 16:33

Darren Tsai · Answer 2 · 2020-02-13T20:13:04.780

0

Here I provide an alternative of findInterval and compare them.

cut(y, t, labels = FALSE)

In comparison with findInterval, if there exist values smaller than the lowest threshold or larger than the highest threshold, cut will return missing values. Eg.

y <- c(-0.5, runif(5), 1.5)
t <- c(0, 0.5, 1)

cut(y, t, F)
# [1] NA 1 1 2 2 1 NA

findInterval(y, t)
# [1]  0 1 1 2 2 1 3

Of course, the result of cut can be converted to that of findInterval.

cut(y, c(-Inf, t, Inf), F) - 1
# [1]  0 1 1 2 2 1 3

In addition, in the document of cut, it says

Instead of cut(*, labels = FALSE), findInterval() is more efficient.

edited Feb 13 '20 at 20:13

answered Feb 13 '20 at 19:24

Darren Tsai

32,117
5
21
51

From "how-to-answer: "Not all questions can or should be answered here. Save yourself some frustration and avoid trying to answer questions which... ...have already been asked and answered many times before." – IRTFM Feb 14 '20 at 04:59

Find categorical indicator vector based on continuous thresholds

2 Answers2