Write rows with certain dates only using Fread in data.table package r

Question

What I want to do is subset large .csv files by certain dates to extract certain years.

What I have done so far is read the whole .csv file using fread and then subset by date.

Below is an example (note I have generated some exmple data rather than reading it in using fread):

# Example data.table created after reading in from fread
library("data.table")
DT <- data.table(seq(as.Date("1999-01-01"), as.Date("2009-01-01"), by="day"))
DT$Var <- sample(1000, size=nrow(DT), replace=TRUE)
colnames(DT) <- c("Date", "Var")

# subset to extract data for the year 2004
DT_2004 <- subset(DT, Date %in% as.Date("2004-01-01"):as.Date("2004-12-31"))

This works but requires me to read in the whole .csv file first which with very large .csv files is quite time consuming. Is there a way to susbset the .csv file within fread so that I only read in the dates I want?

Thank you.

Is reading the file in with `fread` and subsetting slower than the options recommended [here](http://stackoverflow.com/q/1727772/1270695)? — A5C1D2H2I1M1N2O1R2T1, Jan 06 '15 at 09:46
[this](http://stackoverflow.com/questions/27747426/how-to-efficiently-read-the-first-character-from-each-line-of-a-text-file) could also may worth a look — David Arenburg, Jan 06 '15 at 10:05

score 0 · Answer 1 · answered Jan 06 '15 at 19:00

I don't think fread is the problem. I've used fread with 3.5 million line csv file and it wasn't slow! I think the slowness could be caused by using POSIXct dates. Try using IDate instead. Look up ?IDateTime in data.table help. The Description states:

Date and time classes with integer storage for fast sorting and grouping. Still experimental!

But that's not a problem. For your specific question try the following.

DT <- data.table(Date=seq(as.IDate("1999-01-01"),as.IDate("2009-01-01"),by="day"))
DT[,Var:=sample(1000,size=nrow(DT),replace=TRUE)]   # ?`:=`
DT_2004 = DT[between(Date,as.IDate("2004-01-01"),as.IDate("2004-12-31")),.SD]   # ?between

Maybe someone more knowledgeable will come along and rectify our misunderstandings! Until then I hope this is of some help.

Thanks for the reply. I do agree that Fread is fast at reading in the data compared to other methods. It stll takes around 5-10 minutes to read my datasets in with Fread though, so I was just wondering if the dates I wanted to be read could be stipulated within the Fread code. I'll give your method of selecting the years a go. — Catchment_Jack, Jan 09 '15 at 12:00

Write rows with certain dates only using Fread in data.table package r

1 Answers1