0

I have a data.table that looks like:

A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7)
M <- data.table(A,B,C,D) 

My question is similar to R help: divide values by sum produced through factor with a few extra considerations. A specifies a date (I'm simply using integers here). B are individuals. C is a classification in the individual belongs to. D is a value variable.

For each classification c of C, for each day a of A, divide the value D by the sum of the values for all individuals in c, carrying backward when needed such that 0<x-a<=N where x is the date of another individual (meaning that we pick the smallest x-a and use that as an approximation for the value of the other individual in group c on day a).

Let's say N=5. Here's my expected output.

A <- c(1,3,5,20,21,21)
B <- c(1, 2, 3, 4, 5, 6)
C <- c("I","I","II","II","III","III")
D <- c(0.7/(0.7+0.3), 0.3/(0.3), 0.5/(0.5), 0.9/(0.9), 4/(4+7), 7/(4+7))
M <- data.table(A,B,C,D) 

Note that the values for group B are not carried backward for individual 3, as the length is greater than 5 (20-5). Is there a nice way of doing this in data.table?

For each value in D, I wish to divide by the sum of all the values of the same group (either I, II,II) on that day. However, you'll notice for some groups, observations do not exist on that day. I'll try and walk through the logic on a few observations.

Edit: Let me try and walk through a few cases.

For individual 1 (column B) on day 1 (column A), the individual is of group I (column C). Other individuals of group I are: 2. For each of those others, we see that for individual 2, their nearest observation is on day 3 and 3-1<=5, so we'll use 0.3 in the denominator.

For individual 3 (column B) on day 5 (column A), the individual is of group II (column C). Other individuals of group II are: 3. For each of those others, we see that for individual 3, their nearest observation is on day 20 and 20-5>5, so we cannot use their observation in the denominator.

Uwe
  • 41,420
  • 11
  • 90
  • 134
Almacomet
  • 211
  • 2
  • 9
  • Can you explain more what you mean by "pick the smallest x-a and use that as an approximation for the value of the other individual in group c on day a". Why are some of the denominators sums and others the numerator value? – ThatGuy May 25 '17 at 02:22
  • @EddieS Sure, please see my edit. I could do a reshape to a wide format, carry back using something like the zoo package, take the sum, divide, and then reshape back, but I was wondering if there is an efficient data.table way. – Almacomet May 25 '17 at 02:34
  • Do you want to make the denominator a sum of all those values for individuals within a group so long as the date is within N more than the current individual? So, for example, if there was another person in group 'I' with a date of 7 and value of 0.1, 3 would be 0.3/(0.3+0.1) and 7 would be 0.1/(0.1)? – ThatGuy May 25 '17 at 02:41
  • @EddieS Yes, but only going forward (so within positive N of the current individual). So with the new individual, 3 would be 0.3/(0.3+0.1) and and 7 would be 0.1/(0.1) (since those other two individuals are in the past and thus a negative N). – Almacomet May 25 '17 at 02:50

2 Answers2

0

This, I think, will give you your answer:

A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
V <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)

N=5
#Put data into a frame
test = data.frame(A,B,C,V)
#order the data
test = test[order(as.numeric(test$C), test$A),]
#Get the 'rollback' possibilities for each value
Roll = sapply(test$A, FUN = function(x){paste(which(test$A < (x+N) & test$A >= x), collapse=",")})
#Get the groupings
Group = sapply(test$C, FUN = function(x){paste(which(test$C == x), collapse=",")})
#Intersect the values
ToGet = apply(cbind(Roll, Group), MARGIN=1, FUN=function(x){intersect(unlist(strsplit(x[1],",")), unlist(strsplit(x[2],",")))})
#Calculate the denominators
test$D = sapply(ToGet, FUN=function(x){sum(test$V[as.numeric(x)])})
test$Calc = test$V/test$D

Output:

> test
   A B   C   V    D      Calc
1  1 1   I 0.7  1.0 0.7000000
2  3 2   I 0.3  0.4 0.7500000
7  7 7   I 0.1  0.1 1.0000000
3  5 3  II 0.5  0.5 1.0000000
4 20 4  II 0.9  0.9 1.0000000
5 21 5 III 4.0 11.0 0.3636364
6 21 6 III 7.0 11.0 0.6363636
ThatGuy
  • 1,225
  • 10
  • 28
0

The questions is tagged with data.table, so here is a data.table solution which uses non-equi joins to identify individuals within each group to treat them as cohort if the observations fall within a date window of 5 days.

library(data.table)   # CRAN version 1.10.4 used

# set length of date window in days 
N <- 5L
# give columns more semantic names according to OP's description 
setnames(M, c("day", "id", "grp", "val"))

# prepare data for non-equi join: allowable date range
ranged <- M[, .(start = day, end = day + N, co.id = id, grp)]

# non-equi join to determine cohort
joined <- M[ranged, on = c("grp", "day>=start", "day<=end")]

# compute denominator for each cohort
grouped <- joined[, .(den = sum(val)), by = co.id]

# final update on join and order
result <- M[grouped, on = c("id==co.id"), calc := val / den][order(grp, id)]

result
#   day id grp val      calc
#1:   1  1   I 0.7 0.7000000
#2:   3  2   I 0.3 0.7500000
#3:   7  7   I 0.1 1.0000000
#4:   5  3  II 0.5 1.0000000
#5:  20  4  II 0.9 1.0000000
#6:  21  5 III 4.0 0.3636364
#7:  21  6 III 7.0 0.6363636

Data

A <- c(1,3,5,20,21,21, 7)
B <- c(1, 2, 3, 4, 5, 6, 7)
C <- c("I","I","II","II","III","III", "I")
D <- c(0.7, 0.3, 0.5, 0.9, 4, 7, 0.1)
M <- data.table(A,B,C,D)

Compact versions

For those who prefer compact code, here is a more convoluted version:

joined <- M[M[, .(start = day, end = day + N, co.id = id, grp)], 
            on = c("grp", "day>=start", "day<=end")]
M[joined[, .(den = sum(val)), by = co.id], on = c("id==co.id"), 
            calc := val / den][order(grp, id)]

Or, as a "one-liner":

M[M[M[, .(start = day, end = day + N, co.id = id, grp)], 
    on = c("grp", "day>=start", "day<=end")
    ][, .(den = sum(val)), co.id], 
  on = c("id==co.id"), calc := val / den][order(grp, id)]
Uwe
  • 41,420
  • 11
  • 90
  • 134