2

Below is my data frame,

    Row_ID  A      B  
       1   0       0  
       2   0       0  
       3   0       0  
       4   0       1  
       5   0       1  
       6   0       1  
       7  62.75    0  
       8  100      0  
       9  100      0 
      10  100     -1 
      11  100     -1  
      12  100     -1  
      13  100     -1   
      14  87.625  -1  
      15   0       0  
      16   0       0  
      17   0       1   
      18   0       1  
      19   10      1    
      20   13.43   1
      21   67.31   0  
      22   86.5    0
      23   99      0  
      24   99      0  
      25   99      0 

I need to calculate average of A every 5 consecutive rows with the below conditions.
And the values 1 and -1 in B are the two transition states.

a The average of A is based on the values in B.

b The data points from Ato be considered for the average calculation starts from the last data point where the transition in B ends. (Example: Transition in B ends at Row_ID = 6, Hence row number 7,8,9,10 and 11 of A has to be considered for calculating average).

c Referring to the above example, Though there are change in values at row number 10 and 11 of B, corresponding rows in Ait should be considered for calculation of average because those values were considered in average calculation in the previous iteration.

Is there any workaround code without using embed() function because my data has around 900K values and since embed() creates matrix, memory usage wont be that optimal.

Expected Output

    Row_ID  A      B   Avg  
       1   0       0     0  
       2   0       0     0  
       3   0       0     0  
       4   0       1     0  
       5   0       1     0  
       6   0       1     0  
       7  62.75    0     92.55  
       8  100      0     92.55  
       9  100      0     92.55 
      10  100     -1     92.55 
      11  100     -1     92.55  
      12  100     -1     0  
      13  100     -1     0   
      14  87.625  -1     0  
      15   0       0     2  
      16   0       0     2  
      17   0       1     2   
      18   0       1     2  
      19   10      1     2    
      20   13.43   1     0
      21   67.31   0     90.16  
      22   86.5    0     90.16
      23   99      0     90.16  
      24   99      0     90.16  
      25   99      0     90.16
ANmike
  • 267
  • 3
  • 13
  • Why is the average of rows 1-4 equal to 0 instead of 4.75? – thelatemail Jul 20 '17 at 06:07
  • @thelatemail I assume that is is based on the 5 consecutive elements as the first case, it is only 4 elements, then 5, then again 4,.. – akrun Jul 20 '17 at 06:09
  • @thelatemail as mentioned in the question the data points to be considered for the average calculation should start where the transition value in `B` ends, value `0` in `B` is not considered as a state, Hence in `B` transition ends at `Row_ID 4` so avg calculation starts from `5` to next 5 consecutive rows. – ANmike Jul 20 '17 at 06:12
  • Updated the post based on the new example – akrun Jul 20 '17 at 07:50

1 Answers1

2

We can use data.table

library(data.table)
setDT(df1)[, Avg := mean(A)*(.N>4), cumsum(c(TRUE,  diff(abs(B)!=1)==1))]
df1
#    Row_ID  A  B  Avg
# 1:      1  0  0  0.0
# 2:      2  5  1  0.0
# 3:      3  6  1  0.0
# 4:      4  8  1  0.0
# 5:      5  9  0 10.2
# 6:      6  8  0 10.2
# 7:      7 15 -1 10.2
# 8:      8 17 -1 10.2
# 9:      9  2 -1 10.2
#10:     10  6  0  0.0
#11:     11  9  0  0.0
#12:     12  8 -1  0.0
#13:     13  5 -1  0.0
#14:     14  2  0  6.4
#15:     15  9  0  6.4
#16:     16  2  1  6.4
#17:     17  9  1  6.4
#18:     18 10  1  6.4

Or an approach by @thelatemail

setDT(df1)[,  Avg := mean(A) *(.N > 4) , (rleid(B) + 1) %/% 2]

Update

Based on the new example, perhaps this helps

setDT(df2)[, Avg := c(rep(mean(head(A, 5)), 5), rep(0, .N-5)), 
        cumsum(c(TRUE,  diff(abs(B)!=1)==1))]
df2
#    Row_ID       A  B    Avg
# 1:      1   0.000  0  0.000
# 2:      2   0.000  0  0.000
# 3:      3   0.000  0  0.000
# 4:      4   0.000  1  0.000
# 5:      5   0.000  1  0.000
# 6:      6   0.000  1  0.000
# 7:      7  62.750  0 92.550
# 8:      8 100.000  0 92.550
# 9:      9 100.000  0 92.550
#10:     10 100.000 -1 92.550
#11:     11 100.000 -1 92.550
#12:     12 100.000 -1  0.000
#13:     13 100.000 -1  0.000
#14:     14  87.625 -1  0.000
#15:     15   0.000  0  2.000
#16:     16   0.000  0  2.000
#17:     17   0.000  1  2.000
#18:     18   0.000  1  2.000
#19:     19  10.000  1  2.000
#20:     20  13.430  1  0.000
#21:     21  67.310  0 90.162
#22:     22  86.500  0 90.162
#23:     23  99.000  0 90.162
#24:     24  99.000  0 90.162
#25:     25  99.000  0 90.162
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    A variation on a theme - `dat[, avg := if(.N>=5) mean(A), by=(rleid(B)+1) %/% 2]` – thelatemail Jul 20 '17 at 06:11
  • @akrun and @thelatemail, Thanks for the solution.. Its working perfectly fine for the data frame that I have given in the question. But its not the case with my actual dataset. The 5 consecutive rows for the calculation of the average is not being considered. I mean its calculating average for all the values until next transition change. Scaling it down to above data frame ,its printing `6` in `Average Column` from `Row_ID 6 ~ 1` . – ANmike Jul 20 '17 at 06:37
  • @ANmike I think then your example is not a good reproducible one that mimics the original dataset – akrun Jul 20 '17 at 06:40
  • @akrun Yup, you are right, I ll just update my question with my original dataset. – ANmike Jul 20 '17 at 06:50
  • @akrun I my very sorry the inconvenience caused, Please take a look at my update data frame and Kindly help. – ANmike Jul 20 '17 at 07:19
  • 1
    @akrun Thanks a Lot, Working exactly as required. – ANmike Jul 20 '17 at 07:59
  • @akrun Could you please clarify me on this.. Assuming that my first 5 values of the `column A` is not `0`. Then Average is being calculated though there is neither `1` or `-1` in `column B` for first 5 rows. How to avoid this? – ANmike Jul 20 '17 at 08:58
  • @akrun I am finding a new error which says `Error in rep(0, .N - 5) : invalid 'times' argument` , what could be the reason? Or Could you please suggest any other work arounds? – ANmike Aug 11 '17 at 07:52
  • @ANmike It was based on the assumption that you have at least 5 or more rows. If it is not the case, then you may have to add an `if/else` condition – akrun Aug 11 '17 at 07:55
  • @akrun Oh!! currently my data set has around 65536 row, but still the error retains. – ANmike Aug 11 '17 at 07:57
  • @ANmike So it could be a case where the length is not a multiple of that. Could you have new question with a small reproducible example that shows the error – akrun Aug 11 '17 at 08:04
  • @sure I ll do it. – ANmike Aug 11 '17 at 08:07
  • @akrun I have Posted another question which reproduces the error https://stackoverflow.com/questions/45635464/calculating-average-based-on-condition-in-r. Kindly Help. – ANmike Aug 11 '17 at 12:42