1

I am simulating data in R to check which models perform better when outliers and multicollinearity present simultaneously. For this, I split the data into 70:30 random split, but I need to introduce outliers and multicollinearity only in the 70 training samples and to keep the test sample clean. How can I do that in R?

The following is my R-code where outliers and multicollinearity are introduced in the whole data.

      um <- function(R,n,sig,p,po,py,fx,fy){
    
      #' where 'R is the level of multicollinearity between 0 and 1'#
      #' "n" is the sample size
      #' "sig" is the error vatiance
      #' "p" is the number of explanaitory variable
      #' 'po' is percentage outlier in x direction
      #'  'py' is percentage outlier in y direction
      #' 'fx' is magnitude of outlier in x direction
      #' 'fy' is magnitude of outlier in y direction'#
     
      RR=1000
      set.seed(123)
      OP1=NULL
    
      #Explanatory vriables
      
      x=matrix(0,nrow=n,ncol=p)
      W <-matrix(rnorm(n*(p+1),mean=0,sd=1), n, p+1)  
      for (i in 1:n){
        for (j in 1:p){
          x[i,j] <- sqrt(1-R^2)*W[i,j]+(R)*W[i,p+1];      #Introducing multicollinearity
        }    
      }
      
      b=eigen(t(x)%*%x)$vec[,1]
      
      #Invoking outlier
      rep1=sample(1:n, size=po*n, replace=FALSE)
      x[rep1,2]=fx*max(x[,2])+x[rep1,2]     # The point of outlier
      for (i in 1:RR){
        u=rnorm(n,0,sig)
        y=x%*%b+u
        rep2=sample(1:n, size=py*n, replace=FALSE)
        y[rep2]=fy*max(y)+y[rep2]
        
        dat=data.frame(y,x)
        dat[] <- lapply(dat, scale)
        dat<-as.data.frame(dat)
        n=nrow(dat)
        
        mols=matrix(0,nrow= n);mM=matrix(0,nrow= n)
        
        # 70:30 random split
        training_idx = sample(1:nrow(dat),nrow(dat)*0.7,replace=FALSE)
        tes_idx = setdiff(1:nrow(dat),training_idx)
        training = dat[training_idx,]
        xtr=as.matrix(training[,-1])
        ytr=training[,1]
        test = dat[tes_idx,]
        xte=as.matrix(test[,-1])
        yte=test[,1]
        
        # building the models on training data
        mest=rlm(ytr~xtr,psi=psi.huber,k2=1.345,maxit=1000)$coefficients
        ols=lm(ytr~xtr)$coefficients
        
        # Calculate MdAE on test data
        OLS=median(abs(yte-cbind(1,xte)%*%ols))
        M=median(abs(yte-cbind(1,xte)%*%mest))
    
        res2=cbind(OLS,M)
    
        OP1=res2
      }
      
        MAE=(t(OP1))
      
      data.frame(R,n,sig,p,po,py,fx,fy,MAE)
      }
       results=NULL
       R=c(0.99)
       n=c(100)
       sig=c(5)
       p=c(5)
       po=c(0.2)
       py=c(0.2)
       fx=c(5)
       fy=c(5)
    
    for(i in 1:length(R)){
      for(j in 1:length(n)){
        for(k in 1:length(sig)){
          for(l in 1:length(p)){
            for(m in 1:length(po)){
              for(nn in 1:length(py)){
                for(o in 1:length(fx)){
                  for(pp in 1:length(fy)){
                    results=rbind(results,um(R=R[i],n=n[j],sig=sig[k],p=p[l],
                                               po=po[m],py=py[nn],fx=fx[o],fy=fy[pp]))
                  }
                }
              }
            }
          }
        }
      }
    }
    
    View(results)
jeza
  • 299
  • 1
  • 4
  • 21
  • what is the statistical justification for this ? – Mike Jul 12 '21 at 15:01
  • @Mike, I should examine the performance measures on a separate data set to not measure it on a possibly overfit model. – jeza Jul 12 '21 at 15:13
  • Why not run your function a second time, remove/change the parameters for multicolinearity and use it as a test-sample? – tester Jul 14 '21 at 21:05
  • @tester, thanks, but I can not because it needs to be in one loop together. – jeza Jul 15 '21 at 13:32
  • 7
    I find the suggested approach quite dubious. If outliers and collinearity exist only in your training data and not your validation data, both datasets must have been collected fully independently. Thus, you need to do two simulations: one for the training data and one for the validation data. No random splitting a dataset. However, in reality, if these are so fundamentally different, you have a serious issue with your data akquisition. – Roland Jul 16 '21 at 08:31
  • @Roland, because the model will be built on training data where outliers and collinearity are introduced. So I can see how these modules dealing with outliers and collinearity. The performance measure then will be examined on validating data. – jeza Jul 16 '21 at 10:44
  • @jeza I had already understood what you are saying. Apparently, you don't understand my comment. – Roland Jul 16 '21 at 10:49
  • @Roland, Could you please kindly clarify it more. – jeza Jul 16 '21 at 11:13
  • 3
    You intend to simulate a training and a validation dataset with fundamentally different data generating processes. If these datasets are known to have different properties, you can't use the validation dataset to test a model trained on the training dataset. Your proposed simulation is only sensible if looking at the impact of this issue is your goal. Since you are simulating data (and thus **know** the true values of all parameters), you don't need testing data to "see how these modules [are] dealing with outliers and collinearity". – Roland Jul 16 '21 at 11:34

0 Answers0