how to randomly subset a dataset multiple times in r?

Question

I have a dataset which has 5 columns and 24347 observations. I want to generate 10 random datasets from the master dataset. I am using the following code, but I am unable to generate mutiple datasets.

iterations =10
variables = 5

output_i <- matrix(ncol=variables, nrow=iterations)

for(i in 1:iterations){
    output_i <- newdata[sample(nrow(newdata), 100),]  
}

Random subset of rows, columns, both? Do you want them to be mutually exclusive? Equally sized? Randomly sized? — NewNameStat, Jul 01 '16 at 13:42

score 3 · Answer 1 · answered Jul 01 '16 at 13:42

3

Use a list instead. In that example you are overwriting output_i on every pass of the loop.

output <- list()

for(i in 1:iterations){
    output[[i]] <- newdata[sample(nrow(newdata), 100),]  
}

Your first sample will be the first element of the list...

answered Jul 01 '16 at 13:42

Choubi

640
3
9

Just for completeness and even though this is not a very good R practice, the following seems closer to what you had in mind (creating output_1, output_2... as different objects). Inside your loop, use the assign function: assign(paste0("output_", i), newdata[...]) – Choubi Jul 01 '16 at 13:45

score 2 · Answer 2 · answered Jul 01 '16 at 14:43

2

A more "R" way to do this is to ditch the for loop in favour of lapply

sample_data_list <- lapply(1:iterations, function(i) newdata[sample(1:nrow(newdata), 100),])

answered Jul 01 '16 at 14:43

dww

30,425
5
68
111

score 0 · Answer 3 · answered Jul 01 '16 at 13:49

You cannot iterate over i and then write a variable called output_i and expect it to change the variable name over iterations.

I suggest that you use a list to hold the output_i objects.

See code below:

iterations =10

newdata <- matrix(1:(5*24347),ncol=5, nrow=24347)

sample_data_list <- list()

for(i in 1:iterations){
  sample_data_list[[i]] <- newdata[sample(1:nrow(newdata), 100),]  
}

This will generate a list of 10 different samples of 100 observations from the original data.

> str(sample_data_list)
List of 10
 $ : int [1:100, 1:5] 8788 21165 14054 2762 10288 3319 8175 6494 17935 2865 ...
 $ : int [1:100, 1:5] 16351 15621 5455 23679 22460 4283 15251 1008 21474 19218 ...
 $ : int [1:100, 1:5] 16814 21784 9937 5673 8699 7887 23739 3382 429 2550 ...
 $ : int [1:100, 1:5] 21479 12247 8417 7963 14565 4513 3461 10996 16986 8029 ...
 $ : int [1:100, 1:5] 22685 18552 21278 17930 954 9223 17894 343 4677 15571 ...
 $ : int [1:100, 1:5] 13486 3516 5155 1617 16324 15705 12960 12154 20426 1124 ...
 $ : int [1:100, 1:5] 10118 56 2950 12234 953 9479 11098 14272 24303 7672 ...
 $ : int [1:100, 1:5] 1621 12303 14894 718 20877 1682 16234 7019 7926 11954 ...
 $ : int [1:100, 1:5] 915 2957 14657 21297 13652 6750 11996 3621 23321 21818 ...
 $ : int [1:100, 1:5] 11654 20698 5739 6693 6840 10384 20068 10571 18353 5123 ...

score 0 · Answer 4 · edited May 23 '17 at 11:44

I think your best bet is to make a list of data frames rather than your approach using a for loop. We can do this using replicate() which uses lapply().

First, let's create a dummy data frame df that mimics your data, with 5 columns and 24,347 observations:

df<-data.frame(a = rnorm(24347),
               b = rnorm(24347),
               c = rnorm(24347),
               d = rnorm(24347),
               e = rnorm(24347))

Next, set the number of iterations you want, and how big each subset sample should be:

iterations <- 10
subset_size <- 100

Finally, create a list of sampled data frames:

samples_list = replicate(n = iterations,
                     expr = {df[sample(nrow(df), subset_size),]},
                     simplify = F)

This repeats the expression df[sample(nrow(df), subset_size),] for however many iterations you desire and places each newly created data frame in the list samples_list.

You access the data frames just like you would access any other list element:

samples_list[[1]]

Just remember the double brackets around your data frame element, or else it will not work. From here, you can access any particular row or column as normal:

samples_list[[dataframe]][row,column]

If you need more info on lists, I would head over to this post: https://stackoverflow.com/a/24376207/6535514

how to randomly subset a dataset multiple times in r?

4 Answers4