0

I'm trying to read only specific rows from several csv files. Each file has 20000 rows. However, I only want to select rows 350-850, 1350-1850, 2300-2850 et cetera, until 19350-19850 (so any value within the range *350-*850).

Creating a vector with the appropriate numbers doesn't seem possible:

Error in FUN(X[[i]], ...) : 
'nrows' must be a single non-NA number of type numeric or integer

Any suggestions?

  • 1
    To ask the obvious, would there be anything wrong with reading in all 20K rows and then subsetting the data frame? How many columns does it have? – Tim Biegeleisen Mar 07 '16 at 15:20

1 Answers1

4

This does not use fread but does read the file into an SQLite database in a single pass and then only sending the desired rows into R and no others.

It will first read the file into an SQLite database that it creates for you on the fly and will do so without going through R so that R limitations do not apply to that step. Then it will only read into R the indicated rows so that as long as the rows selected (as opposed to the entire file) can fit into R you should be OK. At the end the database is automatically destroyed.

Below, you may need other args to read.csv.sql such as header and sep depending on the file. See ?read.csv.sql and ?sqldf and note that there are differences on the arguments available relative to read.table arguments.

library(sqldf)

from <- seq(350, 19350, 1000)
where <- paste( sprintf("(rowid between %d and %d)", from, from+500), collapse = " or ")

fn$read.csv.sql("myfile.csv", sql = "select * from file where $where")

Note 1: You can check the sql statement like this:

sql <- fn$identity("select * from file where $where") # sql holds text of SQL command
read.csv.sql("myfile.csv", sql = sql)

Note 2: Here is a self contained example:

library(sqldf)

# write test file
write.table(data.frame(a = 1:20000), "myfile.csv", row.names = FALSE)

from <- seq(350, 19350, 1000)
where <- paste( sprintf("(rowid between %d and %d)", from, from+500), collapse = " or ")
DF <- fn$read.csv.sql("myfile.csv", sql = "select * from file where $where")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks, I forgot to install the package. I will do so now and try your solution. –  Mar 07 '16 at 16:17
  • If the third column is called X in the header then `"select X from file where ..."` only reads the third column. With mulitple files (1) you could read them individually and `rbind` them, or (2) append them all to one table in SQL and then read in whatever portion you want or (3) you could create separate SQL tables and then read in portions. Suggest you look through the examples on the sqldf home page (not specifically of this but it might get you more familiar with sql in general): https://github.com/ggrothendieck/sqldf – G. Grothendieck Mar 07 '16 at 16:26