Reading subset of large data

Question

I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?

Thank you

You can use `read.csv`'s `skip` and `nrows` parameters if you know where they are (and they're together). If you don't know that, some `grep` is probably in order. — alistaire, Mar 29 '17 at 21:46
If you really want to keep it all in R, it's reasonably easy to read the file in in batches of rows (how many is practical depends on the memory available), with `lapply`, subsetting each to what you need, and combining the lot after the fact. You'd likely want to use `data.table::fread` or `readr::read_csv` for speed purposes, though, and it still won't be the fastest approach because it's doing a lot of excess processing. Optimizing it a little more wouldn't be that hard, though. — alistaire, Mar 29 '17 at 23:28

score 0 · Answer 1 · edited May 23 '17 at 10:30

0

You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.

But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.

edited May 23 '17 at 10:30

Community

1
1

answered Mar 29 '17 at 21:54

neilfws

32,751
5
50
63

Reading subset of large data

1 Answers1