0

I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?

Thank you

user2145299
  • 81
  • 2
  • 3
  • 8
  • You can use `read.csv`'s `skip` and `nrows` parameters if you know where they are (and they're together). If you don't know that, some `grep` is probably in order. – alistaire Mar 29 '17 at 21:46
  • 1
    If you really want to keep it all in R, it's reasonably easy to read the file in in batches of rows (how many is practical depends on the memory available), with `lapply`, subsetting each to what you need, and combining the lot after the fact. You'd likely want to use `data.table::fread` or `readr::read_csv` for speed purposes, though, and it still won't be the fastest approach because it's doing a lot of excess processing. Optimizing it a little more wouldn't be that hard, though. – alistaire Mar 29 '17 at 23:28

1 Answers1

0

You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.

But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.

Community
  • 1
  • 1
neilfws
  • 32,751
  • 5
  • 50
  • 63