0

I Have just started learning R using RStudio and I have, perhaps, some basic questions. One of them regards the "sample" function. More specifically, my dataset consists of 402224 observations of 147 variables. My task is to take a sample of 50 observations and then produce a dataframe and so on. But when the function sample is executed y = sample(mydata, 50, replace = TRUE, prob = NULL) the result is a dataset with 40224 observations of 50 variables. That is, the sampling is done at variables and not obesrvations.

Do you have any idea why does it happen? Thank you in advance.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
serpiko
  • 1
  • 1
  • 2
  • You can see working sampling code at http://stackoverflow.com/q/8273313/3093387. As you will see in the answers, you need to use row indexing. – josliber May 17 '17 at 16:26

3 Answers3

2

If you want to create a data frame of 50 observations with replacement from your data frame, you can try:

mydata[sample(nrow(mydata), 50, replace=TRUE), ]

Alternatively, you can use the sample_n function from the dplyr package:

sample_n(mydata, 50)
fmic_
  • 2,281
  • 16
  • 23
1

The other answers people have been giving are to select rows, but it looks like you are after columns. You can still accomplish this in a similar way.

Here's a sample df.

df = data.frame(a = 1:5, b = 6:10, c = 11:15)
> df
  a  b  c
1 1  6 11
2 2  7 12
3 3  8 13
4 4  9 14
5 5 10 15

Then, to randomly select 2 columns and all observations we could do this

> df[ , sample(1:ncol(df), 2)]
   c a
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5

So, what you'll want to do is something like this

y = mydata[ , sample(1:ncol(mydata), 50)]
Kristofersen
  • 2,736
  • 1
  • 15
  • 31
0

That is because sample accepts only vectors. try the following:

 library(data.table)
 set.seed(10)
 df_sample<- data.table(df)
 df[sample(.N, 402224 )]
amonk
  • 1,769
  • 2
  • 18
  • 27