How to take a sample of a data set in R

Question

So I have this dataset with 18 variables and about 10,000,000 observations. The set is way to large for my computer to handle so I need to take a smaller sample of the data to analyze it. However, I don't want just a random sample. One of my variables "tip_level" is a factor with two levels, "high" and "low". Is there a way to take a sample of 100,000 observations where 50,000 are "high" and 50,000 are "low" of the variable?

where/how is the data set stored? do you have enough memory available to load the whole thing, if not to analyze it? — Ben Bolker, Oct 24 '18 at 22:45
I don't know what it is specifically that you want to analyze, but if it's something over something, could it be more suited for SQL? — 12b345b6b78, Oct 24 '18 at 22:47
It's on my hard drive, and I'm able to load it into r just fine. Just doing anything with it takes forever to complete. — JareBear, Oct 24 '18 at 22:49

score 2 · Answer 1 · answered Oct 24 '18 at 22:47

2

Assuming you can load the data, how about something like

theseones <- c(sample(which(my_df$tip_level=="high"), 50000), 
               sample(which(my_df$tip_level=="low"), 50000))
my_df[theseones,]

answered Oct 24 '18 at 22:47

Matt Tyers

2,125
1
14
23

@12b345b6b78 - this isn't just a random sample, it's stratified by each tip level as requested. – thelatemail Oct 24 '18 at 23:00

How to take a sample of a data set in R

1 Answers1