Read specific, non-consecutive rows using data.table::fread (equivalent to the “Select” argument, but for rows)?

Question

The fread function (data.table) enables users to define specific columns of dataframe to read in using the 'select' argument (e.g. fread(input, select=c(1,5,10)). I would like the same ability, but for rows (e.g. fread(input,selectrows=c(1,4,47)). I could do this after reading in the files, but that takes a very long time and I hope to optimize the process by only reading in the rows I need.

I am aware of a number of options for selecting rows programmatically based on 'within-file' criteria:

Read csv file with selected rows using data.table's fread

Quickest way to read a subset of rows of a CSV

...but I want to be able to use a vector defined based on criteria outside of the given file to be read in (as in this question, but specifically using fread).

Another alternative might be [`vroom`](https://cran.r-project.org/web/packages/vroom/index.html), a fast CSV (and other format) reader that only reads lines that are needed. — r2evans, Apr 21 '20 at 18:28
I've read through the documentation for vroom, but cannot find the argument where I can define my read-in rows. Can you clarify? — CharismaticChromoFauna, Apr 21 '20 at 19:31
I think it was a hasty recommendation. It's real speed benefits are allegedly when the data contains many string columns. Sorry, I was excited too (it's new to me) :-) — r2evans, Apr 21 '20 at 20:51

score 2 · Accepted Answer · answered Apr 21 '20 at 17:49

One method (although a little brute-force) is to use sed to cut the lines.

Recall that fread takes file= as well as cmd=, as in

library(data.table)
fwrite(iris, "iris.csv")
fread(cmd = "head -n 3 iris.csv")
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa

(Two rows since head doesn't know/care about the header row.)

Try this:

want_rows <- c(1, 3, 147:149)
# due to the header row, add the header and 1 to each of want
paste0(c(1, 1+want_rows), "p")
# [1] "1p"   "2p"   "4p"   "148p" "149p" "150p"
writeLines(paste0(c(1, 1+want_rows), "p"), "commands.sed")

fread(cmd = "sed -n -f commands.sed iris.csv")
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1:          5.1         3.5          1.4         0.2    setosa
# 2:          4.7         3.2          1.3         0.2    setosa
# 3:          6.3         2.5          5.0         1.9 virginica
# 4:          6.5         3.0          5.2         2.0 virginica
# 5:          6.2         3.4          5.4         2.3 virginica
iris[want_rows,]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1            5.1         3.5          1.4         0.2    setosa
# 3            4.7         3.2          1.3         0.2    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 148          6.5         3.0          5.2         2.0 virginica
# 149          6.2         3.4          5.4         2.3 virginica

If you have significant "ranges", then you could optimize this a little for sed, to have an effective command line of sed -ne '1p;2p;4p;148,150p' for the same effect.

There is another method ala "every so many rows" listed here: https://www.thegeekstuff.com/2009/09/unix-sed-tutorial-printing-file-lines-using-address-and-patterns/. I don't know how tightly you can control this (every nth row starting from some arbitrary number, for instance). I don't know that this is your intent or need, though, it sounds like will have arbitrary line numbers.

I like the idea here and I've gotten it running---but somehow it is still faster (for my particular, 5GB files, in which I'm using fewer than 10 rows) to read-in the whole file, then subset the resulting matrix using the 'want_rows' vector (32 sec) compared to the 'sed' option (46 sec). However, your answer does exactly what I asked, so I'm going to mark this as the accepted answer! — CharismaticChromoFauna, Apr 21 '20 at 20:40
It's an interesting problem. Granted, the better answer that you might need but not want or be able to implement is *"use a database"*. — r2evans, Apr 21 '20 at 20:52

Read specific, non-consecutive rows using data.table::fread (equivalent to the “Select” argument, but for rows)?

1 Answers1