1

The fread function (data.table) enables users to define specific columns of dataframe to read in using the 'select' argument (e.g. fread(input, select=c(1,5,10)). I would like the same ability, but for rows (e.g. fread(input,selectrows=c(1,4,47)). I could do this after reading in the files, but that takes a very long time and I hope to optimize the process by only reading in the rows I need.

I am aware of a number of options for selecting rows programmatically based on 'within-file' criteria:

Read csv file with selected rows using data.table's fread

Quickest way to read a subset of rows of a CSV

...but I want to be able to use a vector defined based on criteria outside of the given file to be read in (as in this question, but specifically using fread).

  • Another alternative might be [`vroom`](https://cran.r-project.org/web/packages/vroom/index.html), a fast CSV (and other format) reader that only reads lines that are needed. – r2evans Apr 21 '20 at 18:28
  • 1
    I've read through the documentation for vroom, but cannot find the argument where I can define my read-in rows. Can you clarify? – CharismaticChromoFauna Apr 21 '20 at 19:31
  • I think it was a hasty recommendation. It's real speed benefits are allegedly when the data contains many string columns. Sorry, I was excited too (it's new to me) :-) – r2evans Apr 21 '20 at 20:51

1 Answers1

2

One method (although a little brute-force) is to use sed to cut the lines.

Recall that fread takes file= as well as cmd=, as in

library(data.table)
fwrite(iris, "iris.csv")
fread(cmd = "head -n 3 iris.csv")
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa

(Two rows since head doesn't know/care about the header row.)

Try this:

want_rows <- c(1, 3, 147:149)
# due to the header row, add the header and 1 to each of want
paste0(c(1, 1+want_rows), "p")
# [1] "1p"   "2p"   "4p"   "148p" "149p" "150p"
writeLines(paste0(c(1, 1+want_rows), "p"), "commands.sed")

fread(cmd = "sed -n -f commands.sed iris.csv")
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1:          5.1         3.5          1.4         0.2    setosa
# 2:          4.7         3.2          1.3         0.2    setosa
# 3:          6.3         2.5          5.0         1.9 virginica
# 4:          6.5         3.0          5.2         2.0 virginica
# 5:          6.2         3.4          5.4         2.3 virginica
iris[want_rows,]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1            5.1         3.5          1.4         0.2    setosa
# 3            4.7         3.2          1.3         0.2    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 148          6.5         3.0          5.2         2.0 virginica
# 149          6.2         3.4          5.4         2.3 virginica

If you have significant "ranges", then you could optimize this a little for sed, to have an effective command line of sed -ne '1p;2p;4p;148,150p' for the same effect.

There is another method ala "every so many rows" listed here: https://www.thegeekstuff.com/2009/09/unix-sed-tutorial-printing-file-lines-using-address-and-patterns/. I don't know how tightly you can control this (every nth row starting from some arbitrary number, for instance). I don't know that this is your intent or need, though, it sounds like will have arbitrary line numbers.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 2
    I like the idea here and I've gotten it running---but somehow it is still faster (for my particular, 5GB files, in which I'm using fewer than 10 rows) to read-in the whole file, then subset the resulting matrix using the 'want_rows' vector (32 sec) compared to the 'sed' option (46 sec). However, your answer does exactly what I asked, so I'm going to mark this as the accepted answer! – CharismaticChromoFauna Apr 21 '20 at 20:40
  • It's an interesting problem. Granted, the better answer that you might need but not want or be able to implement is *"use a database"*. – r2evans Apr 21 '20 at 20:52