I wanted to read a csv file (4.5 GB size) in R and make a sample of 100000 rows. This not just a random sample. The file contains 30 dates spread across a month. if i take a sample giving nrows as parameter i am able to take only 3 to 4 distinct dates, as each date contains around 40k rows. Below is my code -
train_data <- fread("C:\\Users\\Bala\\train.csv",
header = T, stringsAsFactors = F, nrows = 100000,
sep = ",")
This is the sample data:
> head(train_data)
id date store_nbr item_nbr unit_sales onpromotion
1: 0 2013-01-01 25 103665 7 NA
2: 1 2013-01-01 25 105574 1 NA
3: 2 2013-01-01 25 105575 2 NA
4: 3 2013-01-01 25 108079 1 NA
5: 4 2013-01-01 25 108701 1 NA
6: 5 2013-01-01 25 108786 3 NA
> table(train_data$date)
2013-01-01 2013-01-02 2013-01-03 2013-01-04
578 41676 40100 17646
My requirement is - i want a sample, say 5000 records, from each date (totally 30 dates in the file). If i could get that, it would be a reasonable sample from the file. Else i could not make a proper assumption from any technique based on my sample. I am not sure if this is possible. Kindly share your thoughts. Thanks