0

I need to simulate 1000 sets of normal distribution(each 60 subgroups, n=5) by using r programming. Each set of normal distribution is contaiminated with 4 outliers(more than 1.5 IQR). can anyone help?

Thanks in advance

  • 3
    Please clarify your question - what is the difference between "set" and "subgroup"? What is n? Do you need to create 1000 dataframes with 60 columns and 5 rows? Or is it something else? What are the mean values and standard deviation values should be used? Are they the same for all sets? – Katia Jun 23 '19 at 05:49
  • i need to simulate 1000 dataframes with 60 rows and 5 columns. the mean is 1 and standard deviation is 0. this is the same for all 1000 dataframes. Addiction to this, i would like to containminate it with 4 outliers(>1.5IQR). – Koh Siew Kiem Jun 23 '19 at 12:36

1 Answers1

0

A very simple approach to create a data.frame with a few outliers :

# Create a vector with normally distributed values and a few outliers
# N - Number of random values
# n.out - number of outliers
my.rnorm <- function(N, num.out, mean=0, sd=1){
  x <- rnorm(N, mean = mean, sd = sd)
  ind <- sample(1:N, num.out, replace=FALSE )
  x[ind] <- (abs(x[ind]) + 3*sd) * sign(x[ind])
  x
}

N=60
num.out = 4
df <- data.frame( col1 = my.rnorm(N, num.out),
                  col2 = my.rnorm(N, num.out),
                  col3 = my.rnorm(N, num.out),
                  col4 = my.rnorm(N, num.out),
                  col5 = my.rnorm(N, num.out))

Please note that I used mean=0 and sd=1 as values mean=1, sd=0 that you provided in the comments do not make much sense.

The above approach does not guarantee that there will be exactly 4 outliers. There will be at least 4, but in some rare cases there could be more as rnorm() function does not guarantee that it never produces outliers.

Another note is that data.frames might not be the best objects to store numeric values. If all your 1000 data.frames are numeric, it is better to store them in matrices.

Depending on the final goal and the type of the object you store your data in (list, data.frame or matrix) there are faster ways to create 1000 objects filled with random values.

Katia
  • 3,784
  • 1
  • 14
  • 27