0

I have 350 csv files with between 10,000 and 700,000 rows in each. I would like to read a subset of each file into R. My method so far is:

library(dplyr)

to_load <- Sys.glob("data/*.csv")

data <- data_frame(file = to_load) %>%
           rowwise() %>%
           do(read.csv(.$file) %>% filter(condition == "a"))

When I try this out with just the first 6 files, the estimated completion time from do() is 3 minutes, which will average out to about 3 hours in total. My question is whether there is a more efficient approach to this. I'm open to trying just about anything.

JoFrhwld
  • 8,867
  • 4
  • 37
  • 32
  • 1
    Try `fread` in the data.table package. – G. Grothendieck Mar 01 '16 at 12:38
  • `data.table::fread` or `readr::read_csv` are better but if you are still having problems then perhaps you should jump outside R and pre-filter those files using `awk` add a sample ID column `paste` and `cat` to one file. Then import. Only if it's a REALLY big job of course. – Stephen Henderson Mar 01 '16 at 12:46
  • [see here](http://stackoverflow.com/questions/32888757/reading-multiple-files-into-r-best-practice/32888918#32888918) for an example how to read multiple files (in which you can use `fread` as well off course) – Jaap Mar 01 '16 at 12:46
  • I'll check these out. It's been a bit since I've been on SO. Why are you all leaving these as comments and not answers? – JoFrhwld Mar 01 '16 at 12:54
  • @JoFrhwld If the answer I linked to is solves your problem, there is no real need to (besides gaining reputation points). This question should then be marked as a duplicate imo. – Jaap Mar 01 '16 at 13:08
  • @JoFrhwld I give alternate advice to the request, others would find it hard to demonstrate without creating lots of example files, you seem fairly competent and they do point you to example code. – Stephen Henderson Mar 01 '16 at 13:08

0 Answers0