Splitting a data into 70:30 but all the outliers only in the training samples using R

Question

I am simulating data in R to check which models perform better when outliers and multicollinearity present simultaneously. For this, I split the data into 70:30 random split, but I need to introduce outliers and multicollinearity only in the 70 training samples and to keep the test sample clean. How can I do that in R?

The following is my R-code where outliers and multicollinearity are introduced in the whole data.

      um <- function(R,n,sig,p,po,py,fx,fy){
    
      #' where 'R is the level of multicollinearity between 0 and 1'#
      #' "n" is the sample size
      #' "sig" is the error vatiance
      #' "p" is the number of explanaitory variable
      #' 'po' is percentage outlier in x direction
      #'  'py' is percentage outlier in y direction
      #' 'fx' is magnitude of outlier in x direction
      #' 'fy' is magnitude of outlier in y direction'#
     
      RR=1000
      set.seed(123)
      OP1=NULL
    
      #Explanatory vriables
      
      x=matrix(0,nrow=n,ncol=p)
      W <-matrix(rnorm(n*(p+1),mean=0,sd=1), n, p+1)  
      for (i in 1:n){
        for (j in 1:p){
          x[i,j] <- sqrt(1-R^2)*W[i,j]+(R)*W[i,p+1];      #Introducing multicollinearity
        }    
      }
      
      b=eigen(t(x)%*%x)$vec[,1]
      
      #Invoking outlier
      rep1=sample(1:n, size=po*n, replace=FALSE)
      x[rep1,2]=fx*max(x[,2])+x[rep1,2]     # The point of outlier
      for (i in 1:RR){
        u=rnorm(n,0,sig)
        y=x%*%b+u
        rep2=sample(1:n, size=py*n, replace=FALSE)
        y[rep2]=fy*max(y)+y[rep2]
        
        dat=data.frame(y,x)
        dat[] <- lapply(dat, scale)
        dat<-as.data.frame(dat)
        n=nrow(dat)
        
        mols=matrix(0,nrow= n);mM=matrix(0,nrow= n)
        
        # 70:30 random split
        training_idx = sample(1:nrow(dat),nrow(dat)*0.7,replace=FALSE)
        tes_idx = setdiff(1:nrow(dat),training_idx)
        training = dat[training_idx,]
        xtr=as.matrix(training[,-1])
        ytr=training[,1]
        test = dat[tes_idx,]
        xte=as.matrix(test[,-1])
        yte=test[,1]
        
        # building the models on training data
        mest=rlm(ytr~xtr,psi=psi.huber,k2=1.345,maxit=1000)$coefficients
        ols=lm(ytr~xtr)$coefficients
        
        # Calculate MdAE on test data
        OLS=median(abs(yte-cbind(1,xte)%*%ols))
        M=median(abs(yte-cbind(1,xte)%*%mest))
    
        res2=cbind(OLS,M)
    
        OP1=res2
      }
      
        MAE=(t(OP1))
      
      data.frame(R,n,sig,p,po,py,fx,fy,MAE)
      }
       results=NULL
       R=c(0.99)
       n=c(100)
       sig=c(5)
       p=c(5)
       po=c(0.2)
       py=c(0.2)
       fx=c(5)
       fy=c(5)
    
    for(i in 1:length(R)){
      for(j in 1:length(n)){
        for(k in 1:length(sig)){
          for(l in 1:length(p)){
            for(m in 1:length(po)){
              for(nn in 1:length(py)){
                for(o in 1:length(fx)){
                  for(pp in 1:length(fy)){
                    results=rbind(results,um(R=R[i],n=n[j],sig=sig[k],p=p[l],
                                               po=po[m],py=py[nn],fx=fx[o],fy=fy[pp]))
                  }
                }
              }
            }
          }
        }
      }
    }
    
    View(results)

@Mike, I should examine the performance measures on a separate data set to not measure it on a possibly overfit model. — jeza, Jul 12 '21 at 15:13
Why not run your function a second time, remove/change the parameters for multicolinearity and use it as a test-sample? — tester, Jul 14 '21 at 21:05
@tester, thanks, but I can not because it needs to be in one loop together. — jeza, Jul 15 '21 at 13:32
I find the suggested approach quite dubious. If outliers and collinearity exist only in your training data and not your validation data, both datasets must have been collected fully independently. Thus, you need to do two simulations: one for the training data and one for the validation data. No random splitting a dataset. However, in reality, if these are so fundamentally different, you have a serious issue with your data akquisition. — Roland, Jul 16 '21 at 08:31
@Roland, because the model will be built on training data where outliers and collinearity are introduced. So I can see how these modules dealing with outliers and collinearity. The performance measure then will be examined on validating data. — jeza, Jul 16 '21 at 10:44
@jeza I had already understood what you are saying. Apparently, you don't understand my comment. — Roland, Jul 16 '21 at 10:49
You intend to simulate a training and a validation dataset with fundamentally different data generating processes. If these datasets are known to have different properties, you can't use the validation dataset to test a model trained on the training dataset. Your proposed simulation is only sensible if looking at the impact of this issue is your goal. Since you are simulating data (and thus **know** the true values of all parameters), you don't need testing data to "see how these modules [are] dealing with outliers and collinearity". — Roland, Jul 16 '21 at 11:34

Splitting a data into 70:30 but all the outliers only in the training samples using R

0 Answers0