3

I have a large csv file and would like to read only certain lines, defined by a vector of row numbers to be read. Is there any way to read these rows without reading the whole csv into memory?

The only solutions I've found seem to allow reading consecutive lines (e.g. lines 2-100).

Rohan
  • 115
  • 2
  • 6
  • could see if `sqldf`helps: https://www.rdocumentation.org/packages/sqldf/versions/0.4-11/topics/read.csv.sql – Jonny Phelps Aug 13 '19 at 15:41
  • @JonnyPhelps Thanks, I thought about trying that but the rows are not easily subsetted from the data in the csv and was hoping to avoid adding a column to the spreadsheet based on whether a row should be selected. – Rohan Aug 13 '19 at 15:43
  • @joran yeah command line might be the way to go, I just thought there might be a quick way in R. – Rohan Aug 13 '19 at 15:47
  • could use appropriate combinations of `skip` and `nrows` to read individual rows (or small chunks), then glue the results together ... but probably won't perform very well. – Ben Bolker Aug 13 '19 at 17:18

2 Answers2

1

A simple example of how you might combine the sed approach I linked to into an R function:

read_rows <- function(file,rows,...){
  tmp <- tempfile()
  row_cmd <- paste(paste(rows,"p",sep = ""),collapse = ";")
  cmd <- sprintf(paste0("sed -n '",row_cmd,"' %s > %s"),file,tmp)
  system(command = cmd)
  read.table(file = tmp,...)
}

write.csv(x = mtcars,file = "~/Desktop/scratch/mtcars.csv")

    > read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(3,6,7),sep = ",",header = FALSE,row.names = 1)
                    V2 V3  V4  V5   V6    V7    V8 V9 V10 V11 V12
Mazda RX4 Wag     21.0  6 160 110 3.90 2.875 17.02  0   1   4   4
Hornet Sportabout 18.7  8 360 175 3.15 3.440 17.02  0   0   3   2
Valiant           18.1  6 225 105 2.76 3.460 20.22  1   0   3   1
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(1,5,9),sep = ",",header = TRUE,row.names = 1)
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

Note the difference with row 1 as the column headers.

joran
  • 169,992
  • 32
  • 429
  • 468
1

sqldf will read it into a database (which it will create and then delete for you) and then read only the rows you want into R. Assuming the csv file created in the Note at the end define the desired Rows and then use read.csv.sql. We have used a temporary file for the database but if the data is sufficiently small you can omit the dbname argument and it will use memory.

library(sqldf)

Rows <- c(3, 5, 10)

s <- toString(Rows)
fn$read.csv.sql("Letters.csv", "select * from file where rowid in ($s)",
   dbname = tempfile())

giving:

     X Letters
1  "3"     "c"
2  "5"     "e"
3 "10"     "j"

If the number of rows desired is very large then rather than embedding the row numbers in the SQL statement create a data frame from them and join it:

library(sqldf)

Rows <- c(3, 5, 10)
RowsDF <- data.frame(Rows)

s <- toString(Rows)
fn$read.csv.sql("Letters.csv", 
  "select file.* from file join RowsDF on file.rowid = RowsDF.Rows",
  dbname = tempfile())

Note

Letters <- data.frame(Letters = letters, stringsAsFactors = FALSE)
write.csv(Letters, "Letters.csv")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341