0

I have a doubt about how to sample a dataframe. The dataset is like this:

with 114 rows with 9 columns. I need to extract 3 subsets, each one of 38 rows (114 / 3).

I have this script, but it doesn't works for the last subset:

install.packages("Rcmdr")
library(Rcmdr)

ana <- read.delim("~/Desktop/ana", header=TRUE, dec=",")

set1 <- ana[sample(nrow(ana), 38), ]
set1.index <- as.numeric(rownames(set1))

ana2 <- ana[(-set1.index),]
set2 <- ana2[sample(nrow(ana2), 38), ]
set2.index <- as.numeric(rownames(set2)) 

ana3 <- ana2[(-set2.index),]
ana3

For set1 and set2 I get the subsets correctly, but for set3 I get 50 rows (or less).

(Thank you in advance! =) )

Community
  • 1
  • 1
user3868641
  • 162
  • 2
  • 13
  • 1
    Welcome to StackOverflow! Please read about [how to provide your data in a reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that it's easier for us to help you. Supplying your data as an image makes it difficult to quickly work with your example code, so consider constructing a fake dataset and supplying it using `dput`. – Thomas Feb 22 '15 at 21:23
  • It seems that you want each row to be in exactly one of the subsets. Here's an example how you could do it using the iris data set. `set_number <- sample(1:3, nrow(iris), replace = TRUE); set1 <- iris[set_number == 1, ]; set2 <- iris[set_number == 2, ]; set3 <- iris[set_number == 3, ]` or `my_sets <- split(iris, set_number)` – talat Feb 22 '15 at 21:42

1 Answers1

2

Generally @docendodiscimus give valid advice but the sampling code he offers will not guarantee equal numbers in the subsets (see below). Try this instead:

 set.seed(123) # best to set a seed to allow roproducibility
 sampidx <- sample( rep(1:3, each=38)
 set1 <- ana[sampidx==1, ]  # logical indexing of dataframe
 set2 <- ana[sampidx==2, ]
 set3 <- ana[sampidx==3, ]

Lack of equipartition with sample using replacement:

> table( sample(1:3, nrow(iris), replace = TRUE) ) 
 1  2  3 
52 52 46 
> table( sample(1:3, nrow(iris), replace = TRUE) )

 1  2  3 
51 49 50    # notice that it also varies from draw to draw
> table(sampidx)
sampidx
 1  2  3 
38 38 38 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you very much, BondedDust, that's exactly what I was looking for! Thank you docendodiscimus for you too! =) – user3868641 Feb 23 '15 at 20:03