0

I have a large csv file with many columns (dimensions 10k x 125). I would like to extract two columns from this file in R for further analysis. I'm already using very fast packages to read the csv (vroom and fread) but they seem to read the entire file and then drop the columns I'm not interested in. Do fast csv readers like fread and vroom read in all columns and then drop the ones I don't want or do they read only those columns that I select?

I'm asking specifically because the fread documentation says about the select argument: "A vector of column names or numbers to keep, drop the rest." The word "drop" makes me think that it first reads them and then drops them.

Here is the vroom code I'm currently using:

system.time(vroom(path, col_select =  c("column_1", "column_2")))
Rows: 10,000
Columns: 2
Delimiter: ","
chr [2]: column_1, column_2

Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
   user  system elapsed 
  0.080   0.159   2.636 
Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 3 (path/to/file)

Here is the fread code with benchmark:

system.time(fread(path, select = c("column_1", "column_2")))
   user  system elapsed 
  0.057   0.007   0.074 
Tea Tree
  • 882
  • 11
  • 26
  • What makes you think `fread` is reading all the columns and then dropping? – Gregor Thomas Oct 13 '20 at 17:35
  • 1
    I'm not really seeing anything to set this apart from this possible dupe: [Only read selected columns](https://stackoverflow.com/q/5788117/903061). – Gregor Thomas Oct 13 '20 at 17:36
  • In the documentation it says for the select argument: "A vector of column names or numbers to keep, drop the rest." The word "drop" makes me think that it first reads them and then drops them. – Tea Tree Oct 13 '20 at 17:46
  • I would be surprised if the highly optimized `data.table` reads the columns you tell it not to. Perhaps asking that as a question would be worth doing to confirm. The rest of your question, besides that implication about `fread`, seems like it is a dupe as I suggest above. – Gregor Thomas Oct 13 '20 at 17:58
  • That makes sense. Can I just ask this question on SO? – Tea Tree Oct 13 '20 at 18:13
  • Yeah. I would just edit this question into that. – Gregor Thomas Oct 13 '20 at 18:43

0 Answers0