Faster way to subset data table instead of a for loop R

Question

I have a data table (you'll need the data table package installed) in R generated with X and Y coordinates and random data values from both normal and uniform distributions. The coordinates represent points on a 2000x1600 array and has to be divided into 16 smaller "sectors" each 500x400. These sectors need their mean of Normal Distribution values taken, divided by the min^2 of the Uniform Distribution values. I also created two variables x and y using a provided function startstop, that have the coordinates for the 16 sectors and a function that calculates the numbers for each sector.

library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))

sectorCalc <- function(x,y,DT) {
    sector <- numeric(length = 16)
    for (i in 1:length(sector)) {
        sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
        sector[i] <- sCalc(sect)
    }
    return(sector)
 }

startstop <- function(width, y = FALSE) {
    startend <- width - (width/4 - 1)
    start <- round(seq(0, startend, length.out = 4))
    stop <- round(seq(width/4, width, length.out = 4))
    if  (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
        dup <- anyDuplicated(c(start,stop))
        stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
    if (y == TRUE) {
        coord <- list(rep(start, each = 4), rep(stop, each = 4))
  } else if (y == FALSE) {
        coord <- list(rep(start, times = 4), rep(stop, times = 4))
  }
  return(coord)
}

x <- startstop(2000)
y <- startstop(1600, T)

sectorNos <- sectorCalc(x,y,DT)

The startstop function isn't really an issue but I need a faster way to subset the data table. Some modifications have to be made to the 'sectorCalc' function. The for loop was the best way I could think of but I don't have too much experience with data tables. Any ideas on a faster method of breaking up the data table?

The question was very good described (incl sample data and code). Just a (really) little wish to improve the question: You sould include `set.seed` when creating sample data with random numbers so the results can be reproduced by different answers. — R Yoda, Feb 25 '16 at 08:25
I haven't used that function before. Thanks for the tip. @RYoda — abbas786, Feb 25 '16 at 15:26
Please please avoid changing the question significantly later since it invalidates previous answers. Can you please explain, what your function does (intention?). It looks calculating something similar to a variance or stddev for the 2 percent of the lowest values of Norm (why 2 %?).Thanks :-) — R Yoda, Feb 25 '16 at 22:39
The function is just a sample calculation for an assignment, it was provided. The purpose is to successfully subset the data table and apply the function over each subset. I can do this successfully, I'm just looking for a faster way to do it. @RYoda — abbas786, Feb 25 '16 at 23:33
As per @RYoda 's comment, I would suggest reverting this question to your original one, then accepting one of the answers (if they work for you). Then, take your new requirement and ask a new question (but link back to this one). — SymbolixAU, Feb 25 '16 at 23:36

SymbolixAU · Answer 1 · 2016-02-25T20:56:57.013

To replace your sectorCalc function I think we can make use of data.tables joins

As you are looping over each row of sector, you just have to create a data.table to join onto that is your sector data, specify a column to join (here I'm using key_col), and specify a 'group' variable for each row, to enable us to do a the calculation at the end:

x <- startstop(2000)
y <- startstop(1600, T)
## copy the original DT
dt <- copy(DT)


dt_xy <- data.table(x_1 = x[[1]],
                    x_2 = x[[2]],
                    y_1 = y[[1]],
                    y_2 = y[[2]])


dt[, key_col := 1]
dt_xy[, `:=`(key_col = 1, xy_grp = seq(1,.N))]

## Use a data.table join, allowing cartesian, then filter out results.
dt_res <- dt[ dt_xy, on="key_col", allow.cartesian=T][x_1 <= X & X <= x_2 & y_1 <= Y & Y <= y_2]

## calculate 'sect' as required.
dt_sect <- dt_res[, .(sect = mean(Norm)/min(Unif)^2)   , by=.(xy_grp)]

Is it possible to `setkey` of each 'sect' before doing the calculation with this method? For example, if there was a way to do that when creating `dt_sect` using either an existing column or a fifth one in `DT`? — abbas786, Feb 25 '16 at 16:11

R Yoda · Accepted Answer · 2016-02-25T20:58:13.983

3

A solution using not only the package data.table but also the cut function to build the interval "groups":

# Create your test data
library(data.table)

set.seed(123)      # make random numbers reproducible to allow comparison of different answers
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))

# calculate the sector by cutting the x and y values into groups defined by the interval breaks
DT[, x.sect := cut(DT[, X], c(0, 499, 1000, 1500, 2000), dig.lab=10)] # Intervals should be: seq(0, 2000, by=500) lower bound is less one since it is not included in the interval (see help for cut function)
DT[, y.sect := cut(DT[, Y], c(0, 399, 800, 1200, 1600), dig.lab=10)] # Intervals should be: seq(0, 1600, by=400)

# Now calculate per group (calculation logic "stolen" from the working answer of user "Symbolix"
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)]

Please note: I think the size of the first and second interval is wrong in the original solution (499 instead of 500 for x and 399 instead of 400 for y so that I could not use the seq function to reproduce your desired intervals but had to enumerate the interval breaks manually).

Edit 1: I have replaced the original code that adds the x.sect and y.sect columns by an improved solution that adds columns by reference (:=).

Edit 2: If you want to order the result you have (at least) two options:

# "Chaining" (output is input of next)
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(x.sect, y.sect),]
# Or: Use the "keyby" param instead of "by"
DT[, .(sect = mean(Norm)/min(Unif)^2), keyby=.(x.sect, y.sect)]

Edit 3: Added dig.lab=10 param to cut function in code above to avoid scientific notation of the interval breaks.

edited Feb 25 '16 at 20:58

answered Feb 25 '16 at 08:21

R Yoda

8,358
2
50
87

Do you know of a way to sort or use `setkey` on each `sect` before doing the calculation? – abbas786 Feb 25 '16 at 18:34
Oh yes, use the `keyby` param of `data.table` instead of `by` (see help `?data.table`. Since the calculation is a "one-time full-table scan" I don't think that `setkey` will improve the performance, but you can try it. From the help doc on `keyby`: *An ad-hoc-by or keyed-by (just as by= defined above) but with an additional setkey() run on the by columns of the result afterwards, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted. Out loud we read keyby= as by= then setkey...* – R Yoda Feb 25 '16 at 20:28
Sorry, perhaps is misunderstood you: Do you want to sort the result in a special order or improve the calculation speed? – R Yoda Feb 25 '16 at 20:30
No I need to sort by another column for statistical purposes. So I need to group by `x.sect` and `y.sect` but if I wanted to sort by 'Norm'. – abbas786 Feb 25 '16 at 21:53
Could I do `DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(Norm),]` – abbas786 Feb 25 '16 at 21:54
Hmh, I am confused. If I order by Norm per group the order doesn't matter in the result since it is "aggregated". What is your goal? You can order the result only by columns contained in the result. – R Yoda Feb 25 '16 at 22:00
I edited the problem to explain. I need to add a function to each `sect`. `sCalc <- function(DT) { setkey(DT, Norm) cells <- DT[1:(nrow(DT)*0.02)] nCells <- nrow(DT) sumCell <- sum(cells[,Norm/sqrt(Unif)]) return(sumCell/nCells) }` – abbas786 Feb 25 '16 at 22:22

Faster way to subset data table instead of a for loop R

2 Answers2

Linked