Sampling from an R Dataframe

Question

I have a dataframe with various real estate listings similar to the following.

ADDRESS      PRICE     ZIP     ...
123 Main St  400,000   45678
23 Green Ln  380,000   45670
29 Green Ln  385,000   45670
...

I want to perform a stratified random sample for a testing dataset. In other words, I want to take ~30% of the entries from each ZIP code and separate them into a new dataset. I am not familiar with R dataframes, so how would I perform such an operation?

I've used the sample function like so

sample(c(1:103), size=31, replace = F)

but how do I put these specific rows into a new dataframe?

8  85   5  83  66  46  39  75 101  94  10  68  63  74  22  86  42
59  52  97  62  11  44  96  88  28   9  36   2  78  49

You can use `sample()` function. – Duck Jul 09 '20 at 14:40 — Duck, Jul 09 '20 at 14:40

Ric S · Accepted Answer · 2020-07-09T14:51:10.850

For a stratified sampling you can use the createDataPartition function from the caret package by inserting the variable according to which you want to stratify (in your case ZIP). By using [[1]] you select the first element of the list which contains the row indices necessary for the split. Afterwards, you subset your original dataset by select only the rows given by train_index

train_index <- caret::createDataPartition(your_data$ZIP, p = 0.7)[[1]]
train_data <- your_data[train_index,]
test_data <- your_data[-train_index,]

score 2 · Answer 2 · answered Jul 09 '20 at 14:46

2

The dplyr solution would be this one I believe:

train_set <- df %>%
  group_by(ZIP) %>%
  sample_frac(0.3)

It will return a dataframe with sample values for each ZIP group

answered Jul 09 '20 at 14:46

Leonardo Diegues

103
1
4

Sampling from an R Dataframe

2 Answers2