So I have this dataset with 18 variables and about 10,000,000 observations. The set is way to large for my computer to handle so I need to take a smaller sample of the data to analyze it. However, I don't want just a random sample. One of my variables "tip_level" is a factor with two levels, "high" and "low". Is there a way to take a sample of 100,000 observations where 50,000 are "high" and 50,000 are "low" of the variable?
Asked
Active
Viewed 77 times
-2
-
where/how is the data set stored? do you have enough memory available to load the whole thing, if not to analyze it? – Ben Bolker Oct 24 '18 at 22:45
-
I don't know what it is specifically that you want to analyze, but if it's something over something, could it be more suited for SQL? – 12b345b6b78 Oct 24 '18 at 22:47
-
It's on my hard drive, and I'm able to load it into r just fine. Just doing anything with it takes forever to complete. – JareBear Oct 24 '18 at 22:49
1 Answers
2
Assuming you can load the data, how about something like
theseones <- c(sample(which(my_df$tip_level=="high"), 50000),
sample(which(my_df$tip_level=="low"), 50000))
my_df[theseones,]

Matt Tyers
- 2,125
- 1
- 14
- 23
-
@12b345b6b78 - this isn't just a random sample, it's stratified by each tip level as requested. – thelatemail Oct 24 '18 at 23:00