sub-setting without reading the whole data based on a column

Question

I have a big data which one of its columns is Date in "character" class. Is there any argument in different reading functions for reading the rows which are only in 1/13/2017 - 1/13/2018 for example? or at least what is the command to subsetting without reading the total data in any class?

thank you for your response

There's no indexing and any extraction is likely to make a temporary copy. Character vectors are effectively hashed but I doubt that will prevent copying. I'm of the opinion that your question is too vague for any further comment. — IRTFM, Jan 13 '18 at 16:36
Possible duplicate of [R: selecting subset without copying](https://stackoverflow.com/questions/9573055/r-selecting-subset-without-copying) — lbusett, Jan 13 '18 at 20:19
@lbusett IMHO, this is not a good dupe target as the linked question is talking about subsets of *objects* whereas the actual question is asking about subsets of files while reading. — Uwe, Jan 14 '18 at 07:45

G. Grothendieck · Accepted Answer · 2018-01-13T20:28:33.790

Suppose we have the test file generated in the Note at the end. We suppose that the actual file is much larger or else we could have just read it directly using read.csv. Instead we use read.csv.sql to only read in 2017 and 2018 into R. Then we use this much smaller data frame cutting it down to precisely the dates we want:

library(sqldf)

sql <- "select * from file where dates like '%2017' or dates like '%2018'"
dd <- read.csv.sql("testfile.csv", sql)
dd$dates <- as.Date(dd$dates, "%d/%m/%Y")
dd_sub <- subset(dd, dates > '2017-01-13' & dates <= '2018-01-13')

To show that it worked we display the first few and last few rows:

> head(dd_sub)
        dates value
14 2017-01-14   744
15 2017-01-15   745
16 2017-01-16   746
17 2017-01-17   747
18 2017-01-18   748
19 2017-01-19   749

> tail(dd_sub)
         dates value
365 2017-12-31  1095
366 2018-01-01  1096
367 2018-01-02  1097
368 2018-01-03  1098
369 2018-01-04  1099
370 2018-01-05  1100

The actual file you have may have different format than testfile.csv so you will need to carefully read ?read.csv.sql and set whatever arguments you need.

Note

# test input in reproducible form
dates <- format(as.Date("2015-01-01") + 1:1100, "%d/%m/%Y")
d <- data.frame(dates, value = seq_along(dates))
write.csv(d, "testfile.csv", row.names = FALSE, quote = FALSE)

sub-setting without reading the whole data based on a column

1 Answers1

Note