-2

So I have this dataset with 18 variables and about 10,000,000 observations. The set is way to large for my computer to handle so I need to take a smaller sample of the data to analyze it. However, I don't want just a random sample. One of my variables "tip_level" is a factor with two levels, "high" and "low". Is there a way to take a sample of 100,000 observations where 50,000 are "high" and 50,000 are "low" of the variable?

JareBear
  • 37
  • 1
  • 5
  • where/how is the data set stored? do you have enough memory available to load the whole thing, if not to analyze it? – Ben Bolker Oct 24 '18 at 22:45
  • I don't know what it is specifically that you want to analyze, but if it's something over something, could it be more suited for SQL? – 12b345b6b78 Oct 24 '18 at 22:47
  • It's on my hard drive, and I'm able to load it into r just fine. Just doing anything with it takes forever to complete. – JareBear Oct 24 '18 at 22:49

1 Answers1

2

Assuming you can load the data, how about something like

theseones <- c(sample(which(my_df$tip_level=="high"), 50000), 
               sample(which(my_df$tip_level=="low"), 50000))
my_df[theseones,]
Matt Tyers
  • 2,125
  • 1
  • 14
  • 23