by-group calculation, limited to first N rows of each group

Question

I asked a question before and received a good answer but I needed to apply it to a more specific problem. The DT needs to be divided into 16 sectors based on X and Y values. The X and Y variables represent the coordinates to loop through and divide the data table. I have successfully divided this data table into 16 different 'sectors' and I need to apply the sCalc function on each sector and output a number. I'm looking for a faster way to do this.

Refer to this link for clarification if needed: Faster way to subset data table instead of a for loop R.

library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each =   2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))

sCalc <- function(DT) { 
    setkey(DT, Norm) 
    cells <- DT[1:(nrow(DT)*0.02)] 
    nCells <- nrow(DT) 
    sumCell <- sum(cells[,Norm/sqrt(Unif)]) 
    return(sumCell/nCells) 
} 

startstop <- function(width, y = FALSE) {
    startend <- width - (width/4 - 1)
    start <- round(seq(0, startend, length.out = 4))
    stop <- round(seq(width/4, width, length.out = 4))
    if  (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
        dup <- anyDuplicated(c(start,stop))
        stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
    if (y == TRUE) {
        coord <- list(rep(start, each = 4), rep(stop, each = 4))
  } else if (y == FALSE) {
        coord <- list(rep(start, times = 4), rep(stop, times = 4))
  }
  return(coord)
}

sectorCalc <- function(x,y,DT) {
    sector <- numeric(length = 16)
    for (i in 1:length(sector)) {
        sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
        sector[i] <- sCalc(sect)
    }
    return(sector)
}

x <- startstop(2000)
y <- startstop(1600, y = TRUE)

sectorLoop <- sectorCalc(x,y,DT)

sectorLoop returns:

-4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221 -4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105

One solution was using the cut function.

DT[, x.sect := cut(DT[, X], seq(0, 2000, by = 500), dig.lab=10)]
DT[, y.sect := cut(DT[, Y], seq(0, 1600, by = 400), dig.lab=10)]
sectorRef <- DT[order(Norm), .(sCalc = sum(Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)])  )/(0.02*.N)), by = .(x.sect, y.sect)]
sectorRef <- sectorRef[[3]]

The above solution returns a data table with the values:

-4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748 -4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344

cor(sectorRef, sectorLoop)

The above returns: 0.0726904

I'm trying to understand this question... what is wrong with the solution you posted? (Please don't expect the answerer to chase down info linked in other questions) — C8H10N4O2, Feb 26 '16 at 04:21
I need a way to apply the provided sCalc function in the third line instead of "sect = mean(Norm)/min(Unif)^2". So the sCalc function will be applied 16 times for each 'sect' instead of the current code. Either that or a way to apply the operations in the sCalc function the way they are done in the third line of the solution. @C8H10N4O2 — abbas786, Feb 26 '16 at 05:00
you realize that sCalc is only using the first two percent of the rows, right? What's the rationale for that? — C8H10N4O2, Feb 26 '16 at 05:24
Yes, it's a provided function for a biostats assignment. The point is to divide the data so its split into 16 smaller arrays 500x400. Then we're given a calculation to apply over the arrays. — abbas786, Feb 26 '16 at 05:32
You did not answer the question. What is the reason for the line `cells <- DT[1:(nrow(DT)*0.02)]` — C8H10N4O2, Feb 26 '16 at 05:38

C8H10N4O2 · Accepted Answer · 2016-02-26T14:08:08.060

1

As far as I can understand the question, the first thing I would explain is that you can use .N to tell you how many rows there are in each by=.(...)group. I think that is analogous to your nCells.

And where your cells takes the top 2% of rows in each group, this can be accomplished at the vector level by indexing [1:(0.02*.N)]. Assuming you want the top 2% in order of increasing Norm (which is the order you would get from setkey(DT, Norm), although setting a key does more than just sorting), you could call setkey(DT, Norm) before the calculation, as in the example, or to make it clearer what you are doing, you could use order(Norm) inside your calculation.

The sum() part doesn't change, so the equivalent third line is:

DT[order(Norm), 
   .(sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/.N), 
   by = .(x.sect, y.sect)]

Which returns the operation for the 16 groups:

         x.sect      y.sect       sCalc
 1: (1500,2000]  (800,1200] -0.09380209
 2:  (499,1000]   (399,800] -0.09833151
 3:  (499,1000] (1200,1600] -0.09606350
 4:     (0,499]   (399,800] -0.09623751
 5:     (0,499]  (800,1200] -0.09598717
 6: (1500,2000]     (0,399] -0.09306580
 7: (1000,1500]   (399,800] -0.09669593
 8: (1500,2000]   (399,800] -0.09606388
 9: (1500,2000] (1200,1600] -0.09368166
10:  (499,1000]     (0,399] -0.09611643
11: (1000,1500]     (0,399] -0.09404482
12:     (0,499] (1200,1600] -0.09387951
13: (1000,1500] (1200,1600] -0.10069461
14: (1000,1500]  (800,1200] -0.09825285
15:     (0,499]     (0,399] -0.09890184
16:  (499,1000]  (800,1200] -0.09756506

edited Feb 26 '16 at 14:08

answered Feb 26 '16 at 05:47

C8H10N4O2

18,312
8
98
134

Actually I think this is almost what I need. Is it possible to sort each of the 16 groups by a column like "Norm" before doing the calculations or even taking the 2%? – abbas786 Feb 26 '16 at 05:56
@abbas786 OK, that makes a little more sense, I have updated the answer as well as your title to make it clearer what you are asking. For help with a particular package (like data.table) it is always a good idea to use the package tag. There are experts on this package who probably didn't see your question because it wasn't tagged. – C8H10N4O2 Feb 26 '16 at 14:11
So I compared my older, slower method with this method on the same `DT` and I expected results to be close enough. I used `cor()` to compare the two values and it returned '0.3880811'. I imagine the way the array is divided is slightly different for the two methods but the correlation shouldn't be so low right? – abbas786 Feb 27 '16 at 21:39
The `order()` is applied per group before the calculations right? – abbas786 Feb 27 '16 at 21:50
Actually based on the function shouldn't it be `sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/0.02*.N`? – abbas786 Feb 27 '16 at 21:57
So I just realized I made a mistake, in the function `nCells` should be `nrow(cells)` not `nrow(DT)`. I just caught that, so based on that change shouldn't the above be correct? But I ran it and the numbers didn't match at all to the slower method. @C8H10N4O2 – abbas786 Feb 27 '16 at 22:00
1

@abbas786 yes based on that change the above should be correct. When you say "I ran it but the numbers didn't match" you need to edit your question to show how the correct function, the way you ran it, and what the desired output is. Please see [how to make a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – C8H10N4O2 Feb 28 '16 at 01:19
1

and you should `set.seed( )` whenever you generate random numbers, so you **can** reproduce the same answer each time – SymbolixAU Feb 28 '16 at 06:55
I end up taking the mean of the 16 values and it all averages out to be the same. So that works. Thanks – abbas786 Feb 28 '16 at 07:07

by-group calculation, limited to first N rows of each group

1 Answers1