What is the fastest way to lookup a large number of values using R?

Question

I have a list of over 1,000,000 numbers. I have a lookup table that has a range of numbers and a category. For example, 0-200 is category A, 201-650 is category B (the ranges are not of equal length)

I need to simply iterate over the list of 1,000,000 numbers and get a list of the 1,000,000 corresponding categories.

EDIT:

For example, the first few elements of my list are - 100, 125.5, 807.5, 345.2, and it should return something like 1,1,8,4 as categories. The logic for the mapping is implemented in a function - categoryLookup(cd) and I'm using the following command to get the categories

cats <- sapply(list.cd, categoryLookup)

However, while this seems to be working quickly on lists of size up to 10000, it is taking a lot of time for the whole list.

What is the fastest way to do the same? Is there any form of indexing that can help speed up the process?

Perhaps have a look at `?cut` and its arguments `breaks` and `labels`? — Henrik, Oct 28 '14 at 07:36
To get more specific answers, please be more specific in your question, i.e. post a minimal reproducible example: include a _minimal_ version of your "list" and "lookup table", the desired result, and show the code you have tried. — Henrik, Oct 28 '14 at 07:48
Oh, and are your boundaries integer only? In your example, you give non-overlapping boundaries (ie what happens to 200.5? A, or B?) — Spacedman, Oct 28 '14 at 08:44
Please provide a **minimal, self contained example**. Check these links for general ideas, and how to do it in R: [**here**](http://stackoverflow.com/help/mcve), [**here**](http://www.sscce.org/), [**here**](http://adv-r.had.co.nz/Reproducibility.html), and [**here**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). — Henrik, Oct 28 '14 at 08:50

Karolis Koncevičius · Answer 1 · 2014-10-28T08:27:49.700

1

The numbers:

numbers <- sample(1:1000000)

groups:

groups <- sort(rep(letters, 40000))

lookup:

categories <- groups[numbers]

EDIT:

If you don't yet have the vector of "groups" you can create it first.

Assume you have data-frame with range info:

ranges <- data.frame(group=c("A","B","C"),
                     start=c(0,300001,600001),
                     end=c(300000,600000,1000000)
                    )

ranges
  group  start   end
1     A      1 3e+05
2     B 300001 6e+05
3     C 600001 1e+06

# if groups are sorted and don't overlap:
groups <- rep(ranges$group, (ranges$end-ranges$start)+1)

Then continue as before

categories <- groups[numbers]

EDIT: as @jbaums said - you will have to add +1 to the (ranges$end-ranges$start) in this case. (already edited in the example above). Also in this case your starting coordinate should be 1 and not a 0

edited Oct 28 '14 at 08:27

answered Oct 28 '14 at 07:43

Karolis Koncevičius

9,417
9
56
89

However, your solution assumes that the`numbers` are integers, whereas it's possible that `numbers` is an unsorted vector of numbers that the OP needs to classify into the groups based on the ranges in which they fall. I guess it's up to the OP to provide more detail about what they have and what they expect. – jbaums Oct 28 '14 at 08:12
1

:) You are right again. In that case something like `group[ceiling(numbers)]` should work.. If the starting/ending points themselves can be non-integers then I will have to think about another solution. But I will wait for some kind of response from OP first. – Karolis Koncevičius Oct 28 '14 at 08:17

What is the fastest way to lookup a large number of values using R?

1 Answers1