0

I want to create a list of 10 artificial dataframes with the same mean of my original dataframe, and an artificially increased standard deviation. In each of the 10 artificial dataframes, the columns should have the same name and size of the original dataset. Each column should also have the same expected mean as the corresponding column in the original dataframe, but the expected standard deviation of each column is increased by 10%, 20% etc in each artificial dataframe relative to the standard deviation of the column in the original dataframe. So each artificial dataframe in the list will be corresponding to an x increase in the expected standard deviation of each column, where x in x = seq(10,100,10). One artificial dataframe will have the same exact columns of the original dataframe with same expected mean and size, but an expected standard deviation increased by 10%, the second sample will have an expected standard deviation of each column increased by 20% and so on.

Based on this other post Generate random numbers with fixed mean and sd this is my attempt so far:

#First define the function to draw random numbers given specific values of n, mean, sd

    rnorm2 <- function(n,mean,sd) {mean+sd*scale(rnorm(n))} 

#Create random df for replicability
df = data.frame(replicate(9,sample(0:1,100,rep=TRUE)))

names(df) = c("a", "b", "b" , "d", 
            "e" , "f", "g" ,
            "h" , "i")

#Compute and store mean, standard deviation and size of each column in my dataset:

# First, initialize  vectors

columns = c("a", "b", "b" , "d", 
            "e" , "f", "g" ,
            "h" , "i")

ncols = length(df)

column_means <- vector(mode = "numeric", length = ncols)
column_sd <- vector(mode = "numeric", length = ncols)

#Now loop through each column to obtain mean, standard deviation and increased standard deviation by x

xs = seq(10, 100, 10)

for(x in seq_along(xs)){
    for (i in seq_along(columns)){
     column_means[i] <- mean(df[[i]], na.rm = TRUE)
     column_sd[i] <- sd(df[[i]], na.rm = TRUE)
     column_sd_new[[i]][x] <- column_sd[i] + ((x/100)*column_sd[i])
 } 
}

However this gives me the following error:

Error in `*tmp*`[[i]] : subscript out of bounds

Also, I cannot find a way to apply the rnorm2 function to obtain a list of 10 artificial dataframes.

Any help would be greatly appreciated!

txz10001
  • 1
  • 2
  • 1
    Keep in mind that it is bad habit to create objects with the same name as source functions – Yacine Hajji Oct 21 '22 at 14:59
  • 1
    "this does not work" is not very helpful. In what way does it not work? Does it give an error? Does it run without error but not produce the result you want? Also bear in mind that sampling from a population and constructing a dataset with given characteristics are two very different things. Do you require your artificial data frames to have *exactly* the specified mean and variance, or merely for their *expected* mean and variance to be as speficied? if you are sampling from a population, I fear the former will be impossible. – Limey Oct 21 '22 at 15:06
  • Thank you, I edited based on your comments. It gives me the following error: Error in `*tmp*`[[i]] : subscript out of bounds. Also, I don't know how to apply the rnorm2 function to obtain my final list of datasets. I want my artificial dataset to have the expected mean and variance as specified. – txz10001 Oct 21 '22 at 15:11

0 Answers0