Random sampling with replace = FALSE, results in duplicates being selected using R

Question

I have a requirement where in 70 odd data needs to be separated into 7 random data frame without any duplicate rows being selected, i.e. Replace = FALSE has been used still it picks duplicate rows even with sample_n() function results are the same.

Is it a bug known?

How will this be remedied as for future requirements this makes and arduous job to manually select.

df = name = c("arjun","Andrea", "Biswas","Ann","Biju", "Sheela","Deepti","Betty", "Hema", "Gowri"," Kunal", "Anamika","Ashik", "Hina","Kiran" )

gender = c("M","F", "M","M","F", "M","F","F","F", "F","M","F", "F","F","M")

etc like wise 5 with additional columns each group needs two females and rest males. but the basic splitting itself is having duplicates generated in group say arjun is in group 1 and 3 Andrea is in group 2 and 3 etc which should not happen.

code i tried

library(dplyr)
L4 = list()
dfc = list()
f = list()

numzone <- c(1:5)

for (i in numzone){

L4  <-  df[sample(nrow(df) ,size = 3 ,replace = FALSE),]




f<-paste("df", i, sep="")

dfc <- L4



if (i %in% c(1:5)) {

f <- dfc[]
}



print(f)

Also, additionally I need this separated rows be assigned to dynamic data frame, may be from a list defined.

thanks for the update I tried to run but pool is giving error, no pool defined — Satish Kumar, Jul 12 '21 at 04:25
dfn be replaced with df and include some more data in multiple of 7 say around 14, 28, say df = name = c("arjun","Andrea", "Biswas" "alok","Ajay", "Biinus" ) gender = c("M","F", "M") — Satish Kumar, Jul 13 '21 at 14:00
Have you seen this? https://stackoverflow.com/questions/37145863/splitting-a-data-frame-into-equal-parts — Skaqqs, Jul 13 '21 at 23:45

Skaqqs · Answer 1 · 2021-07-10T14:34:11.767

I don't think you found a bug, but it is hard to say without example data and the code you are using. I'd wager that you are running sample() (or whatever similar function) more than once, which even if you used replace=FALSE, could return the same value multiple times. In other words, replace=FALSE is only going to affect each call of sample().

If you'd like more specific advice, feel free to edit your question with an example dataset and the code you are using.

Here is a general approach to this problem. More information on sampling, here, and more information on splitting, here.

# values to be randomly sampled
pool <- 1:70

# dataframe to receive random samples
dat <- data.frame(matrix(ncol = 7, nrow = 0))

# take random samples, split into  7 groups
split(sample(x = pool, size = 70, replace = FALSE), f = ceiling(seq_along(pool)/10))

# rbind to other data
rbind(dat, split(sample(x = pool, size = 70, replace = FALSE), f = ceiling(seq_along(pool)/10)))

Random sampling with replace = FALSE, results in duplicates being selected using R

1 Answers1