Most frequent values in sliding window dataframe in R

Question

I have the following dataframe (df):

    A                    B           T    Required col (window = 3)
1   1                    0           1     
2   3                    0           3   
3   4                    0           4        
4   2                    1           1        4
5   6                    0           0        2
6   4                    1           1        0
7   7                    1           1        1
8   8                    1           1        1
9   1                    0           0        1

I would like to add the required column, as followed: Insert in the current row the previous row value of A or B. If in the last 3 (window) rows most of time the content of A column is equal to T column - choose A, otherwise - B. (There can be more columns - so the content of the column with the most times equal to T will be chosen). What is the most efficient way to do it for big data table.

Interesting problem. Have you tried anything? Package [`zoo`](https://cran.r-project.org/web/packages/zoo/index.html) has several `roll*` functions that would help with this, likely starting with "roll count for A, then roll count for B, then compare the two". Otherwise, please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (consumable sample data). — r2evans, May 25 '18 at 14:16

IceCreamToucan · Accepted Answer · 2018-05-25T15:32:50.860

I changed the column named T to be named TC to avoid confusion with T as an abbreviation for TRUE

library(tidyverse)
library(data.table)

df[, newcol := {
  equal <- A == TC
  map(1:.N, ~ if(.x <= 3) NA
              else if(sum(equal[.x - 1:3]) > 3/2) A[.x - 1] 
              else B[.x - 1])
}]
df

#    N A B TC newcol
# 1: 1 1 0  1     NA
# 2: 2 3 0  3     NA
# 3: 3 4 0  4     NA
# 4: 4 2 1  1      4
# 5: 5 6 0  0      2
# 6: 6 4 1  1      0
# 7: 7 7 1  1      1
# 8: 8 8 1  1      1
# 9: 9 1 0  0      1

This works too, but it's less clear, and likely less efficient

df[, newcol := shift(A == TC, 1:3) %>% 
                  pmap_lgl(~sum(...) > 3/2) %>% 
                  ifelse(shift(A), shift(B))]

data:

df <- fread("
N    A                    B           TC   
1   1                    0           1     
2   3                    0           3   
3   4                    0           4        
4   2                    1           1        
5   6                    0           0        
6   4                    1           1        
7   7                    1           1        
8   8                    1           1        
9   1                    0           0        
")

score 1 · Answer 2 · answered May 25 '18 at 14:44

1

Probably much less efficient than the answer by Ryan, but without additional packages.

A<-c(1,3,4,2,6,4,7,8,1)
B<-c(0,0,0,1,0,1,1,1,0)
TC<-c(1,3,4,1,0,1,1,1,0)
req<-rep(NA,9)
df<-data.frame(A,B,TC,req)
window<-3
for(i in window:(length(req)-1)){
  equal <- sum(df$A[(i-window+1):i]==df$TC[(i-window+1):i])
  if(equal > window/2){
    df$req[i+1]<-df$A[i]  
  }else{
    df$req[i+1]<-df$B[i]
  }
}

answered May 25 '18 at 14:44

otwtm

1,779
1
16
27

`data.table` is quicker with more complicated tasks such as grouping operations, but I don't think it offers any speed/memory benefit this case. – IceCreamToucan May 25 '18 at 14:50

Most frequent values in sliding window dataframe in R

2 Answers2