0

Consider the following example data, stored in a dataframe called df

df
x  y
2  4
1  5
0  8

As you can see, there are 3 rows to this dataframe. What I'd like to do is take 100 row samples, where each row has an equal probability of being selecting (in this case 1/3). My output, let's call it df_result would look something like this:

df_result
x  y
0  8
2  4
0  8
1  5
1  5
2  4

etc..... until 100 samples are taken.

I saw this previous stackoverflow post which detailed how to take random samples for a dataframe: df[sample(nrow(df), 3), ]

However, when I tried to sample 100 rows, this (predictably) did not work, and did not allow for the sampling probability to be assigned.

Any tips?

Thanks`

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
lecreprays
  • 77
  • 1
  • 13
  • 1
    `df[sample(nrow(df),100,replace=TRUE),]` – HubertL May 19 '17 at 02:16
  • @HubertL Thanks. When I try to set the prob=c(rep(1/3,3)) argument in the sample function I get the error: "incorrect number of probabilities". Does the sample function automatically assign equal weights? – lecreprays May 19 '17 at 02:22
  • 1
    I'm not sure why... it works with `df[sample(3,100,replace=TRUE,prob=c(rep(1/3,3))),]` – HubertL May 19 '17 at 02:27
  • 1
    `modelr::resample` (e.g. `modelr::resample(df, sample(nrow(df), 100, replace = TRUE))`) is good for this at scale, as it just stores a pointer and the indices instead of redundant data. To expand it to a data.frame pass it to `as.data.frame`, though models can handle it directly. – alistaire May 19 '17 at 04:05

1 Answers1

0
df <- read.table(header = TRUE,
                text = "x  y
2  4
1  5
0  8")

set.seed(1)
df[sample(nrow(df), 10, replace=T), ]

    x y
1   2 4
2   1 5
2.1 1 5
3   0 8
1.1 2 4
3.1 0 8
3.2 0 8
2.2 1 5
2.3 1 5
1.2 2 4
Jeppe Olsen
  • 968
  • 8
  • 19