I have a large csv file with many columns (dimensions 10k x 125). I would like to extract two columns from this file in R for further analysis. I'm already using very fast packages to read the csv (vroom and fread) but they seem to read the entire file and then drop the columns I'm not interested in. Do fast csv readers like fread and vroom read in all columns and then drop the ones I don't want or do they read only those columns that I select?
I'm asking specifically because the fread documentation says about the select argument: "A vector of column names or numbers to keep, drop the rest." The word "drop" makes me think that it first reads them and then drops them.
Here is the vroom code I'm currently using:
system.time(vroom(path, col_select = c("column_1", "column_2")))
Rows: 10,000
Columns: 2
Delimiter: ","
chr [2]: column_1, column_2
Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
user system elapsed
0.080 0.159 2.636
Warning message:
In .Internal(gc(verbose, reset, full)) :
closing unused connection 3 (path/to/file)
Here is the fread code with benchmark:
system.time(fread(path, select = c("column_1", "column_2")))
user system elapsed
0.057 0.007 0.074