1

I wanted to read a csv file (4.5 GB size) in R and make a sample of 100000 rows. This not just a random sample. The file contains 30 dates spread across a month. if i take a sample giving nrows as parameter i am able to take only 3 to 4 distinct dates, as each date contains around 40k rows. Below is my code -

train_data <- fread("C:\\Users\\Bala\\train.csv",
                    header = T, stringsAsFactors = F, nrows = 100000,
                    sep = ",")

This is the sample data:

> head(train_data)
   id       date store_nbr item_nbr unit_sales onpromotion
1:  0 2013-01-01        25   103665          7          NA
2:  1 2013-01-01        25   105574          1          NA
3:  2 2013-01-01        25   105575          2          NA
4:  3 2013-01-01        25   108079          1          NA
5:  4 2013-01-01        25   108701          1          NA
6:  5 2013-01-01        25   108786          3          NA
> table(train_data$date)

2013-01-01 2013-01-02 2013-01-03 2013-01-04 
       578      41676      40100      17646 

My requirement is - i want a sample, say 5000 records, from each date (totally 30 dates in the file). If i could get that, it would be a reasonable sample from the file. Else i could not make a proper assumption from any technique based on my sample. I am not sure if this is possible. Kindly share your thoughts. Thanks

zx8754
  • 52,746
  • 12
  • 114
  • 209
Bala
  • 67
  • 8
  • Sampling groups is not too complicated. You need to pick a grammar, though; `fread` returns a data.table by default, which has its own grammar apart from base R. – alistaire Oct 31 '17 at 04:10
  • Thank you. Could you please share a link or some more points on this? I am not clear still. – Bala Oct 31 '17 at 05:09
  • Here are some nice ideas: https://stackoverflow.com/questions/22261082/load-a-small-random-sample-from-a-large-csv-file-into-r-data-frame – Adiel Loinger Oct 31 '17 at 08:40
  • All the links mentioned above are random samples. My question is not about random sample. I have 30 dates in my dataset. each date has around 40k rows. I want a sample to extract exactly 5000 rows from each date. So 30* 5000 = 150000 rows sample. Hope i am clear. – Bala Oct 31 '17 at 12:16

0 Answers0