Sampling randomly from an array of dataframes

Question

I have created multiple dataframes based on various conditions. Now I would like to sample the different dataframes but I would like to remove the lines once they are sampled. I have tried dplyrs sample_n:

sample_n(df, 4)

the problem is that this does not remove the lines, would I need some recursive loop that would remove the lines once they are sampled? Or is there some handy function that can help me?

Please provide a small reproducible example and expected output — akrun, Jan 17 '17 at 08:40
Have a look at the `modelr` package for the tidyverse approach. — Axeman, Jan 17 '17 at 09:06
@akrun the same question was not asked. I did not merely want to sample the dataframe, I need to not sample the same data again once I sample subsequent times. — Lowpar, Jan 17 '17 at 09:53
I think it could be categorized as a general dupe as was done [here](http://stackoverflow.com/questions/41689941/moving-variablecolumns-to-column-name-vertical-to-horizontal-in-r/41689954#41689954). Anyway, I am reopening it if you find it objectionable — akrun, Jan 17 '17 at 09:56
@akrun, I think it was my fault was not elaborating on the title correctly. — Lowpar, Jan 17 '17 at 10:01

score 6 · Accepted Answer · answered Jan 17 '17 at 08:45

6

Works for me.

# generate data
a <- data.frame(letters = letters[1:5], var = rnorm(5))
b <- data.frame(letters = letters[6:10], var = rnorm(5))
c <- data.frame(letters = letters[11:15], var = rnorm(5))
xy <- list(a, b, c)

set.seed(357) # set seed for reproducibility
dfsample <- sample(seq_len(length(xy)), 1) # sample out one data.frame

xy[[dfsample]]

  letters         var
1       a  1.51348192
2       b -0.60657737
3       c  0.51828252
4       d -0.05352487
5       e -1.34303266

# remove random row, notice the minus sign in front of the sample
xy[[dfsample]] <- xy[[dfsample]][-sample(1:nrow(xy[[dfsample]]), 1), ]
xy[[dfsample]]

  letters         var
2       b -0.60657737
3       c  0.51828252
4       d -0.05352487
5       e -1.34303266

answered Jan 17 '17 at 08:45

Roman Luštrik

69,533
24
154
197

Any particular reason for `seq_len(length(xy))` instead of `seq_along(xy)`? – Axeman Jan 17 '17 at 08:54
Great response, indeed this could be implemented into a recursive function, thank you very much for your help! – Lowpar Jan 17 '17 at 10:10
1

@Axeman I rarely pay any attention to that part, so no particular reason. – Roman Luštrik Jan 17 '17 at 13:52

score 0 · Answer 2 · answered Jan 17 '17 at 11:53

modelr::crossv_mc(mtcars, 5, 0.5)

creates 5 sets of exclusive splits of equal size. They are stored as list columns, and use the resample class, which is memory efficient.

# A tibble: 5 × 3
           train           test   .id
          <list>         <list> <chr>
1 <S3: resample> <S3: resample>     1
2 <S3: resample> <S3: resample>     2
3 <S3: resample> <S3: resample>     3
4 <S3: resample> <S3: resample>     4
5 <S3: resample> <S3: resample>     5

Sampling randomly from an array of dataframes

2 Answers2